Scalable and Language-Independent Embedding-based Approach for Plagiarism Detection Considering Obfuscation Type: No Training Phase

Gharavi, Erfaneh; Veisi, Hadi; Rosso, Paolo

doi:10.1007/s00521-019-04594-y

Identificarse

Buscar en RiuNet

Listar

Todo RiuNet
Esta colección

Mi cuenta

Acceder

Estadísticas

Ver Estadísticas de uso

Ayuda RiuNet

Admin. UPV

Compartir/Enviar a

Citas

Estadísticas

Scalable and Language-Independent Embedding-based Approach for Plagiarism Detection Considering Obfuscation Type: No Training Phase

Mostrar el registro sencillo del ítem

Ficheros en el ítem

Nombre: Gharavi;Veisi;Rosso ...

Tamaño: 748.6Kb

Formato: PDF

Descripción: Versión del Autor.

Abrir

Nombre: Gharavi_et_al_NCA.pdf

Tamaño: 620.7Kb

Formato: PDF

Descripción: Versión editorial

Solicitar una copia al autor

dc.contributor.author	Gharavi, Erfaneh	es_ES
dc.contributor.author	Veisi, Hadi	es_ES
dc.contributor.author	Rosso, Paolo	es_ES
dc.date.accessioned	2021-01-26T04:31:57Z
dc.date.available	2021-01-26T04:31:57Z
dc.date.issued	2020-07	es_ES
dc.identifier.issn	0941-0643	es_ES
dc.identifier.uri	http://hdl.handle.net/10251/159837
dc.description.abstract	[EN] The efficiency and scalability of plagiarism detection systems have become a major challenge due to the vast amount of available textual data in several languages over the Internet. Plagiarism occurs in different levels of obfuscation, ranging from the exact copy of original materials to text summarization. Consequently, designed algorithms to detect plagiarism should be robust to the diverse languages and different type of obfuscation in plagiarism cases. In this paper, we employ text embedding vectors to compare similarity among documents to detect plagiarism. Word vectors are combined by a simple aggregation function to represent a text document. This representation comprises semantic and syntactic information of the text and leads to efficient text alignment among suspicious and original documents. By comparing representations of sentences in source and suspicious documents, pair sentences with the highest similarity are considered as the candidates or seeds of plagiarism cases. To filter and merge these seeds, a set of parameters, including Jaccard similarity and merging threshold, are tuned by two different approaches: offline tuning and online tuning. The offline method, which is used as the benchmark, regulates a unique set of parameters for all types of plagiarism by several trials on the training corpus. Experiments show improvements in performance by considering obfuscation type during threshold tuning. In this regard, our proposed online approach uses two statistical methods to filter outlier candidates automatically by their scale of obfuscation. By employing the online tuning approach, no distinct training dataset is required to train the system. We applied our proposed method on available datasets in English, Persian and Arabic languages on the text alignment task to evaluate the robustness of the proposed methods from the language perspective as well. As our experimental results confirm, our efficient approach can achieve considerable performance on the different datasets in various languages. Our online threshold tuning approach without any training datasets works as well as, or even in some cases better than, the training-base method.	es_ES
dc.description.sponsorship	The work of Paolo Rosso was partially funded by the Spanish MICINN under the research Project MISMIS-FAKEn-HATE on Misinformation and Miscommunication in social media: FAKE news and HATE speech (PGC2018-096212-B-C31).	es_ES
dc.language	Inglés	es_ES
dc.publisher	Springer-Verlag	es_ES
dc.relation.ispartof	Neural Computing and Applications	es_ES
dc.rights	Reserva de todos los derechos	es_ES
dc.subject	Text alignment	es_ES
dc.subject	Language-independent plagiarism detection	es_ES
dc.subject	Word embedding	es_ES
dc.subject	Text representation	es_ES
dc.subject	Obfuscation type	es_ES
dc.subject.classification	LENGUAJES Y SISTEMAS INFORMATICOS	es_ES
dc.title	Scalable and Language-Independent Embedding-based Approach for Plagiarism Detection Considering Obfuscation Type: No Training Phase	es_ES
dc.type	Artículo	es_ES
dc.identifier.doi	10.1007/s00521-019-04594-y	es_ES
dc.relation.projectID	info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020/PGC2018-096212-B-C31/ES/DESINFORMACION Y AGRESIVIDAD EN SOCIAL MEDIA: AGREGANDO INFORMACION Y ANALIZANDO EL LENGUAJE/	es_ES
dc.rights.accessRights	Abierto	es_ES
dc.contributor.affiliation	Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació	es_ES
dc.description.bibliographicCitation	Gharavi, E.; Veisi, H.; Rosso, P. (2020). Scalable and Language-Independent Embedding-based Approach for Plagiarism Detection Considering Obfuscation Type: No Training Phase. Neural Computing and Applications. 32(14):10593-10607. https://doi.org/10.1007/s00521-019-04594-y	es_ES
dc.description.accrualMethod	S	es_ES
dc.relation.publisherversion	https://doi.org/10.1007/s00521-019-04594-y	es_ES
dc.description.upvformatpinicio	10593	es_ES
dc.description.upvformatpfin	10607	es_ES
dc.type.version	info:eu-repo/semantics/publishedVersion	es_ES
dc.description.volume	32	es_ES
dc.description.issue	14	es_ES
dc.relation.pasarela	S\409341	es_ES
dc.contributor.funder	Agencia Estatal de Investigación	es_ES
dc.description.references	Agarwal B, Ramampiaro H, Langseth H, Ruocco M (2018) A deep network model for paraphrase detection in short text messages. Inf Process Manag 54(6):922–937	es_ES
dc.description.references	Al-Suhaiqi M, Hazaa MAS, Albared M (2018) Arabic English cross-lingual plagiarism detection based on keyphrases extraction, monolingual and machine learning approach. Asian J Res Comput Sci 2:1–12	es_ES
dc.description.references	Alvi F, Stevenson M, Clough PD (2014) Hashing and merging heuristics for text reuse detection. CLEF (working notes), pp 939–946	es_ES
dc.description.references	Asghari H, Mohtaj S, Fatemi O, Faili H, Rosso P, Potthast M (2016) Algorithms and corpora for Persian plagiarism detection. In: CEUR workshop proceedings, 1737, pp 135–144	es_ES
dc.description.references	Bengio Y, Ducharme R, Vincent P, Janvin C (2003) A neural probabilistic language model. J Mach Learn Res 3:1137–1155. https://doi.org/10.1162/153244303322533223	es_ES
dc.description.references	Bojanowski P, Grave E, Joulin A, Mikolov T (2016) Enriching word vectors with subword information. ArXiv preprint arXiv:1607.04606	es_ES
dc.description.references	Chong M, Specia L, Mitkov R (2010) Using natural language processing for automatic detection of plagiarism. Language. Retrieved from http://clg.wlv.ac.uk/papers/show_paper.php?ID=272	es_ES
dc.description.references	Clough P (2003) Old and new challenges in automatic plagiarism detection. National Plagiarism Advisory Service (February), 14. Retrieved from http://scholar.google.com/scholar?hl=en&btnG=Search&q=intitle:Old+and+new+challenges+in+automatic+plagiarism+detection#0	es_ES
dc.description.references	Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12:2493–2537	es_ES
dc.description.references	Ehsan N, Shakery A, Tompa FW (2018) Cross-lingual text alignment for fine-grained plagiarism detection. J Inf Sci. https://doi.org/10.1177/0165551518787696	es_ES
dc.description.references	Esteki F, Esfahani FS (2016) A plagiarism detection approach based on SVM for Persian texts. In: CEUR workshop proceedings, 1737, pp 149–153	es_ES
dc.description.references	Ferrero J, Besacier L, Schwab D, Agnès F (2017) Using word embedding for cross-language plagiarism detection. In: Proceedings of the 15th conference of the European chapter of the association for computational linguistics: volume 2, short papers. https://doi.org/10.18653/v1/E17-2066	es_ES
dc.description.references	Firth JR (1957) A synopsis of linguistic theory, 1930–1955. Studies in linguistic analysis	es_ES
dc.description.references	Gharavi E, Veisi H, Bijari K, Zahirnia K (2018) A fast multi-level plagiarism detection method based on document embedding representation. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics). https://doi.org/10.1007/978-3-319-73606-8_7	es_ES
dc.description.references	Gharavi E, Bijari K, Veisi H, Zahirnia K (2016) A deep learning approach to Persian plagiarism detection. Retrieved from https://pdfs.semanticscholar.org/b0a8/7335289264368a7ee804acc7715fc4799310.pdf	es_ES
dc.description.references	Glinos DG (2014) A hybrid architecture for plagiarism detection. CLEF (working notes), pp 958–965	es_ES
dc.description.references	Gross P, Modaresi P (2014) Plagiarism alignment detection by merging context seeds. CLEF (working notes), pp 966–972	es_ES
dc.description.references	Hinton G (1986) Learning distributed representations of concepts. In: CSS, pp 1–12. https://doi.org/10.1109/69.917563	es_ES
dc.description.references	Hoad TC, Zobel J (2003) Methods for identifying versioned and plagiarized documents. J Am Soc Inf Sci Technol 54:203–215. https://doi.org/10.1002/asi.10170	es_ES
dc.description.references	Kalchbrenner N, Grefenstette E, Blunsom P (2014) A convolutional neural network for modelling sentences. In: ACL, pp 655–665. https://doi.org/10.3115/v1/P14-1062	es_ES
dc.description.references	Le QV, Mikolov T (2014) Distributed representations of sentences and documents, vol 32. https://doi.org/10.1145/2740908.2742760	es_ES
dc.description.references	Leilei K, Haoliang Q, Cuixia D, Mingxing W, Zhongyuan H (2013) Approaches for source retrieval and text alignment of plagiarism detection: notebook for PAN at CLEF 2013. In: CEUR workshop proceedings, 1179	es_ES
dc.description.references	Leilei K, Haoliang Q, Shuai W, Cuixia D (2012) Approaches for candidate document retrieval and detailed comparison of plagiarism detection. Notebook for PAN at CLEF 2012. Retrieved from http://www.uni-weimar.de/medien/webis/research/events/pan-12/pan12-papers-final/pan12-plagiarism-detection/kong12-notebook.pdf	es_ES
dc.description.references	Livermore MA, Dadgostari F, Guim M, Beling P, Rockmore D (2018) Law search as prediction. Virginia Public Law and Legal Theory Research Paper (2018-61)	es_ES
dc.description.references	Mashhadirajab F, Shamsfard M (2016) A text alignment algorithm based on prediction of obfuscation types using SVM neural network. FIRE (working notes), pp 167–171	es_ES
dc.description.references	Mikolov T, Corrado G, Chen K, Dean J (2013) Efficient estimation of word representations in vector space. In: Proceedings of the international conference on learning representations (ICLR 2013), pp 1–12. https://doi.org/10.1162/153244303322533223	es_ES
dc.description.references	Mikolov T, Yih W, Zweig G (2013) Linguistic regularities in continuous space word representations. In: Proceedings of NAACL-HLT (June), pp 746–751. Retrieved from http://scholar.google.com/scholar?hl=en&btnG=Search&q=intitle:Linguistic+Regularities+in+Continuous+Space+Word+Representations#0%5Cnhttps://www.aclweb.org/anthology/N/N13/N13-1090.pdf	es_ES
dc.description.references	Minaei B, Niknam M (2016) An n-gram based method for nearly copy detection in plagiarism systems. FIRE (working notes), pp 172–175	es_ES
dc.description.references	Mitchell J, Lapata M (2010) Composition in distributional models of semantics. Cognit Sci 34(8):1388–1429. https://doi.org/10.1111/j.1551-6709.2010.01106.x	es_ES
dc.description.references	Momtaz M, Bijari K, Salehi M, Veisi H (2016) Graph-based approach to text alignment for plagiarism detection in persian documents. FIRE (working notes), pp 176–179	es_ES
dc.description.references	Palkovskii Y, Belov A (2013) Using hybrid similarity methods for plagiarism detection. Notebook for PAN at CLEF 2013	es_ES
dc.description.references	Palkovskii Y, Belov A (2014) Developing high-resolution universal multi-type N-gram plagiarism detector. Working notes papers of the CLEF 2014 evaluation labs, pp 984–989	es_ES
dc.description.references	Pennington J, Socher R, Manning CD (2014) GloVe: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing, pp 1532–1543. https://doi.org/10.3115/v1/D14-1162	es_ES
dc.description.references	Potthast M, Stein B, Eiselt A, Barrón-Cedeño A, Rosso P (2009) Overview of the 1st international competition on plagiarism detection. In: SEPLN 09 workshop on uncovering plagiarism, authorship, and social software misuse, pp 1–9. Retrieved from http://ceur-ws.org/Vol-502	es_ES
dc.description.references	Potthast M, Hagen M, Beyer A, Busse M, Tippmann M, Rosso P, Stein B (2014) Overview of the 6th international competition on plagiarism detection. Notebook for PAN at CLEF 2014, pp 845–876	es_ES
dc.description.references	Potthast M, Hagen M, Gollub T, Tippmann M, Kiesel J, Rosso P, Stamatatos E, Stein B (2013) Overview of the 5th international competition on plagiarism detection. In: CEUR workshop proceedings, 1179	es_ES
dc.description.references	Potthast M, Stein B, Barrón-cedeño A, Rosso P (2010) An evaluation framework for plagiarism detection. In: Proceedings of the 23rd international conference on computational linguistics (COLING 2010) (August), pp 997–1005. Retrieved from http://dl.acm.org/citation.cfm?id=1944566.1944681	es_ES
dc.description.references	Qimin C, Qiao G, Yongliang W, Xianghua W (2015) Text clustering using VSM with feature clusters. Neural Comput Appl 26(4):995–1003	es_ES
dc.description.references	Rodríguez Torrejón D, Martín Ramos J (2014) CoReMo 2.3 plagiarism detector text alignment module: notebook for PAN at CLEF 2014. In: CEUR workshop proceedings, 1180, pp 997–1003	es_ES
dc.description.references	Sanchez-Perez MA, Sidorov G, Gelbukh A (2014) The winning approach to text alignment for text reuse detection at PAN 2014: notebook for PAN at CLEF 2014. In: CEUR workshop proceedings, 1180, pp 1004–1011	es_ES
dc.description.references	Sánchez-Vega F, Villatoro-Tello E, Montes-y-Gómez M, Rosso P, Stamatatos E, Villaseñor-Pineda L (2019) Paraphrase plagiarism identification with character-level features. Pattern Anal Appl 22(2):669–681	es_ES
dc.description.references	Shrestha P, Maharjan S, Solorio T (2014) Machine translation evaluation metric for text alignment. CLEF (working notes), pp 1012–1016	es_ES
dc.description.references	Shrestha P, Solorio T (2013) Using a variety of n-grams for the detection of different kinds of plagiarism. Notebook for PAN at CLEF	es_ES
dc.description.references	Socher R (2014) Recursive deep learning for natural language processing and computer vision. Ph.D. thesis (August). https://papers.nips.cc/paper/4204-dynamic-pooling-and-unfolding-recursive-autoencoders-for-paraphrase-detection.pdf	es_ES
dc.description.references	Socher R, Huang E, Pennington J (2011) Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In: Advances in neural information processing systems, pp 801–809. Retrieved from http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2011_0538.pdf%5Cnhttps://papers.nips.cc/paper/4204-dynamic-pooling-and-unfolding-recursive-autoencoders-for-paraphrase-detection.pdf	es_ES
dc.description.references	Socher R, Manning CDC, Ng AYA (2010) Learning continuous phrase representations and syntactic parsing with recursive neural networks. In: Proceedings of the NIPS-2010 deep learning and unsupervised feature learning workshop, pp 1–9. https://doi.org/10.1007/978-3-540-87479-9	es_ES
dc.description.references	Socher R, Manning C, Huval B, Ng A (2012) Semantic compositionality through recursive matrix-vector spaces. In: EMNLP-CoNLL’12: Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning, pp 1201–1211. https://doi.org/10.1162/153244303322533223	es_ES
dc.description.references	Suchomel Š, Kasprzak J, Brandejs M et al (2013) Diverse queries and feature type selection for plagiarism discovery. Notebook for PAN at CLEF 2013	es_ES
dc.description.references	Tai KS, Socher R, Manning CD (2015) Improved semantic representations from tree-structured long short-term memory networks. Proc ACL. https://doi.org/10.1515/popets-2015-0023	es_ES
dc.description.references	Talebpour A, Shirzadi M, Aminolroaya Z (2016) Plagiarism detection based on a novel trie-based approach. In: CEUR workshop proceedings, 1737, pp 180–183	es_ES

Este ítem aparece en la(s) siguiente(s) colección(ones)

Artículos, conferencias, monografías [48357]

Mostrar el registro sencillo del ítem

Scalable and Language-Independent Embedding-based Approach for Plagiarism Detection Considering Obfuscation Type: No Training Phase

RiuNet: Repositorio Institucional de la Universidad Politécnica de Valencia

Buscar en RiuNet

Listar

Todo RiuNet

Esta colección

Mi cuenta

Estadísticas

Ayuda RiuNet

Admin. UPV

Compartir/Enviar a

Citas

Estadísticas

Scalable and Language-Independent Embedding-based Approach for Plagiarism Detection Considering Obfuscation Type: No Training Phase

Ficheros en el ítem

Este ítem aparece en la(s) siguiente(s) colección(ones)