- -

Paraphrase Plagiarism Identifcation with Character-level Features

RiuNet: Repositorio Institucional de la Universidad Politécnica de Valencia

Compartir/Enviar a

Citas

Estadísticas

  • Estadisticas de Uso

Paraphrase Plagiarism Identifcation with Character-level Features

Mostrar el registro sencillo del ítem

Ficheros en el ítem

dc.contributor.author Sánchez-Vega, Fernando es_ES
dc.contributor.author Villatoro-Tello, Esaú es_ES
dc.contributor.author Montes-y-Gómez, Manuel es_ES
dc.contributor.author Rosso, Paolo es_ES
dc.contributor.author Stamatatos, Efstathios es_ES
dc.contributor.author Villaseñor-Pineda, Luis es_ES
dc.date.accessioned 2021-01-27T04:32:44Z
dc.date.available 2021-01-27T04:32:44Z
dc.date.issued 2019-05 es_ES
dc.identifier.issn 1433-7541 es_ES
dc.identifier.uri http://hdl.handle.net/10251/159992
dc.description.abstract [EN] Several methods have been proposed for determining plagiarism between pairs of sentences, passages or even full documents. However, the majority of these methods fail to reliably detect paraphrase plagiarism due to the high complexity of the task, even for human beings. Paraphrase plagiarism identi cation consists in automatically recognizing document fragments that contain re-used text, which is intentionally hidden by means of some rewording practices such as semantic equivalences, discursive changes, and morphological or lexical substitutions. Our main hypothesis establishes that the original author's writing style ngerprint prevails in the plagiarized text even when paraphrases occur. Thus, in this paper we propose a novel text representation scheme that gathers both content and style characteristics of texts, represented by means of character-level features. As an additional contribution, we describe the methodology followed for the construction of an appropriate corpus for the task of paraphrase plagiarism identi cation, which represents a new valuable resource to the NLP community for future research work in this field. es_ES
dc.description.sponsorship This work is the result of the collaboration in the framework of the CONACYT Thematic Networks program (RedTTL Language Technologies Network) and the WIQ-EI IRSES project (Grant No. 269180) within the FP7 Marie Curie action. The first author was supported by CONACYT (Scholarship 258345/224483). The second, third, and sixth authors were partially supported by CONACyT (Project Grants 258588 and 2410). The work of the fourth author was partially supported by the SomEMBED TIN2015-71147-C2-1-P MINECO research project and by the Generalitat Valenciana under the Grant ALMAMATER (PrometeoII/2014/030). es_ES
dc.language Inglés es_ES
dc.publisher Springer-Verlag es_ES
dc.relation.ispartof Pattern Analysis and Applications es_ES
dc.rights Reserva de todos los derechos es_ES
dc.subject Plagiarism identification es_ES
dc.subject Paraphrase plagiarism es_ES
dc.subject Text reuse es_ES
dc.subject Character n-grams es_ES
dc.subject Stylistic representation es_ES
dc.subject.classification LENGUAJES Y SISTEMAS INFORMATICOS es_ES
dc.title Paraphrase Plagiarism Identifcation with Character-level Features es_ES
dc.type Artículo es_ES
dc.identifier.doi 10.1007/s10044-017-0674-z es_ES
dc.relation.projectID info:eu-repo/grantAgreement/CONACyT//FC 2016-2410/ es_ES
dc.relation.projectID info:eu-repo/grantAgreement/EC/FP7/269180/EU/Web Information Quality Evaluation Initiative/
dc.relation.projectID info:eu-repo/grantAgreement/CONACyT//258588/ es_ES
dc.relation.projectID info:eu-repo/grantAgreement/GVA//PROMETEOII%2F2014%2F030/ES/ Adaptive learning and multimodality in machine translation and text transcription/ es_ES
dc.relation.projectID info:eu-repo/grantAgreement/CONACyT//258345%2F224483/ es_ES
dc.relation.projectID info:eu-repo/grantAgreement/MINECO//TIN2015-71147-C2-1-P/ES/COMPRENSION DEL LENGUAJE EN LOS MEDIOS DE COMUNICACION SOCIAL - REPRESENTANDO CONTEXTOS DE FORMA CONTINUA/ es_ES
dc.rights.accessRights Abierto es_ES
dc.contributor.affiliation Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació es_ES
dc.description.bibliographicCitation Sánchez-Vega, F.; Villatoro-Tello, E.; Montes-Y-Gómez, M.; Rosso, P.; Stamatatos, E.; Villaseñor-Pineda, L. (2019). Paraphrase Plagiarism Identifcation with Character-level Features. Pattern Analysis and Applications. 22(2):669-681. https://doi.org/10.1007/s10044-017-0674-z es_ES
dc.description.accrualMethod S es_ES
dc.relation.publisherversion https://doi.org/10.1007/s10044-017-0674-z es_ES
dc.description.upvformatpinicio 669 es_ES
dc.description.upvformatpfin 681 es_ES
dc.type.version info:eu-repo/semantics/publishedVersion es_ES
dc.description.volume 22 es_ES
dc.description.issue 2 es_ES
dc.relation.pasarela S\409334 es_ES
dc.contributor.funder Generalitat Valenciana es_ES
dc.contributor.funder European Commission es_ES
dc.contributor.funder Consejo Nacional de Ciencia y Tecnología, México es_ES
dc.contributor.funder Ministerio de Economía y Competitividad es_ES
dc.description.references Barrón-Cedeño A, Rosso P (2009) On automatic plagiarism detection based on n-grams comparison. In: Proceedings of the 31th European conference on IR research on advances in information retrieval (ECIR), LNCS vol 5478, Springer, Berlin, pp 696–700 es_ES
dc.description.references Barron-Cedeño A, Vila M, Martí MA, Rosso P (2013) Plagiarism meets paraphrasing: insights for the next generation in automatic plagiarism detection. Comput Linguist 39(4):917–947 es_ES
dc.description.references Basile C, Benedetto D, Caglioti E, Cristadoro G, Esposti M (2009) A plagiarism detection procedure in three steps: selection, matches and “squares”. In: Proceedings of the SEPLN 2009 workshop on uncovering plagiarism, authorship and social software misuse (PAN 2009), CEUR-WS vol 502. Donostia-San Sebastian, Spain es_ES
dc.description.references Biggins S, Mohammed S, Oakley S (2012) University of shefield: two approaches to semantic text similarity. In: First joint conference on lexical and computational semantics (SEM at NAACL 2012), Montreal, Canada, pp 655–661 es_ES
dc.description.references Burrows S, Potthast M, Stein B (2013) Paraphrase acquisition via crowdsourcing and machine learning. ACM Trans Intell Syst Technol 4(3):43:1–43:21. https://doi.org/10.1145/2483669.2483676 es_ES
dc.description.references Calvo H, Segura-Olivares A, García A (2014) Dependency vs. constituent based syntactic n-grams in text similarity measures for paraphrase recognition. Computación y Sistemas 18(3):517554 es_ES
dc.description.references Chien-Ying C, Jen-Yuan Y, Hao-Ren K (2010) Plagiarism detection using rouge and wordnet. J Comput 2(3):34–44 es_ES
dc.description.references Chong M, Specia L, Mitkov R (2010) Using natural language processing for automatic detection of plagiarism. In: Proceedings of the 4th international plagiarism conference. Newcastle-upon-Tyne, UK es_ES
dc.description.references Clough P (2003) Old a new challenges in automatic plagiarism detection. In: National plagiarism advisory service, pp 391–407 es_ES
dc.description.references Clough P, Gaizauskas R, Piao SS, Wilks Y (2002) Meter: Measuring text reuse. In: Proceedings of the 40th annual meeting of the association for computational linguistics (ACL). Philadelphia es_ES
dc.description.references Courtney C, Mihalcea R (2005) Measuring the semantic similarity of texts. In: Proceedings of the ACL workshop on empirical modeling of semantic equivalence and entailment (EMSEE at NAALC 2005), pp 13–18 es_ES
dc.description.references Daelemans W (2013) Explanation in computational stylometry. In: 14th International conference on intelligent text processing and computational linguistics (CIC-Ling 2013), Lecture Notes in Computer Science LNCS, vol 7817, pp 451–462 es_ES
dc.description.references Ehsan N, Shakery A (2016) Candidate document retrieval for cross-lingual plagiarism detection using two-level proximity information. Inf Process Manag. https://doi.org/10.1016/j.ipm.2016.04.006 es_ES
dc.description.references Grieve J (2007) Quantitative authorship attribution: an evaluation of techniques. Lit Linguist Comput 22(3):251–270 es_ES
dc.description.references Hartrumpf S, vor Der Brück T, Eichhorn C (2010) Semantic duplicate identification with parsing and machine learning. In: Eleventh international conference on text, speech and dialogue (TSD 2010) LNAI vol 6231, Springer, Berlin, pp 84–92. Brno, Czech Republic es_ES
dc.description.references Hoad TC, Zobel J (2003) Methods for identifying versioned and plagiarised documents. J Am Soc Inform Sci Technol 54:203–215 es_ES
dc.description.references Koppel M, Schler J, Argamon S (2009) Computational methods in authorship attribution. J Am Soc Inf Sci Technol 60(1):9–26 es_ES
dc.description.references Koppel M, Schler J, Argamon S (2011) Authorship attribution in the wild. Lang Resour Eval 45:83–94 es_ES
dc.description.references Man PD (1983) Blindness and insight: essays in the rhetoric of contemporary criticism, 2nd ed. chap. Literature and Language: A Commentary, pp. 277–89. Routtloedge es_ES
dc.description.references McNamee P, Mayfield J (2004) Character n-gram tokenization for european language text retrieval. Inf Retr 7(1–2):73–97 es_ES
dc.description.references Oberreuter G, L’Huillier G, Ríos SA, Velásquez JD (2011) Approaches for intrinsic and external plagiarism detection. In: Notebook for PAN at CLEF’11 es_ES
dc.description.references Palkovskii Y, Belov A, Muzyka I (2011) Using wordnet-based semantic similarity measurement in external plagiarism detection. In: Notebook for PAN at CLEF’11 es_ES
dc.description.references Potthast M, Hagen M, Gollub T, Tippmann M, Kiesel J, Rosso P, Stamatatos E, Stein B (2013) Overview of the 5th international competition on plagiarism detection. In: CLEF 2013 evaluation labs and workshop working notes papers es_ES
dc.description.references Ravi NR, Gupta D (2015) Efficient paragraph based chunking and download filtering for plagiarism source retrieval. In: Notebook for PAN at CLEF 2015 evaluation labs and workshop working notes papers, PAN ’15. http://www.uni-weimar.de/medien/webis/events/pan-15/pan15-papers-final/pan15-plagiarism-detection/ravi15-notebook.pdf es_ES
dc.description.references Sapkota U, Bethard S, Montes-y Gómez M, Solorio T (2015) Not all character n-grams are created equal: a study in authorship attribution. In: Conference of the North American chapter of the association for computational linguistics human language technologies (NAACL-HLT 2015), pp 93–102 es_ES
dc.description.references Sapkota U, Solorio T, Montes M, Bethard S, Rosso P (2014) Cross-topic authorship attribution: will out-of-topic data help? In: Proceedings of COLING 2014, the 25th international conference on computational linguistics: technical papers, pp 1228–1237. Dublin City University and Association for Computational Linguistics. http://aclweb.org/anthology/C14-1116 es_ES
dc.description.references Schleimer S, Wilkerson DS, Aiken A (2003) Winnowing: local algorithms for document fingerprinting. In: Proceedings of the 2003 ACM SIGMOD international conference on management of data, SIGMOD ’03, pp 76–85. ACM, New York. https://doi.org/10.1145/872757.872770 es_ES
dc.description.references Sediyono A, Mahamud K (2008) Algorithm of the longest commonly consecutive word for plagiarism detection in text based document. In: Digital information management, ICDIM ’08, pp 253–259. IEEE. https://doi.org/10.1109/ICDIM.2008.4746827 es_ES
dc.description.references Shivakumar N, Garcia-Molina H (1995) Scam: a copy detection mechanism for digital documents. In: Proceedings of the second annual conference on the theory and practice of digital libraries es_ES
dc.description.references Si A, Leong HV, Lau RWH (1997) Check: a document plagiarism detection system. In: Proceedings of ACM symposium for applied computing, SAC ’97, pp. 70–77. ACM, New York. https://doi.org/10.1145/331697.335176 es_ES
dc.description.references Sánchez-Vega F, Villatoro-Tello E, Montes-y Gómez M, Villaseñor-Pineda L, Rosso P (2013) Determining and characterizing the reused text for plagiarism detection. Expert Syst Appl 40(5):1804–1813 es_ES
dc.description.references Stamatatos E (2011) Plagiarism detection using stopword n-grams. J Am Soc Inf Sci Technol 62(12):2512–2527 es_ES
dc.description.references Stamatatos E (2013) On the robustness of authorship attribution based on character n-gram features. J Law Policy 21(2):421–439 es_ES
dc.description.references Stein B, Potthast M, Rosso P, Barrón-Cedeño A, Stamatatos E, Koppel M (2011) Fourth international workshop on uncovering plagiarism, authorship, and social software misuse. SIGIR Forum 45:45–48 es_ES
dc.description.references Uzuner Özlem, Katz B, Nahnsen T (2005) Using syntactic information to identify plagiarism. In: Proceedings of 2nd workshop on building educational applications using NLP. Ann Arbor es_ES
dc.description.references Xu W, Ritter A, Dolan WB, Grishman R, Cherry C (2012) Paraphrasing for style. In: Proceedings of COLING 2012: Technical Papers, pp 2899–2914. Mumbai es_ES
dc.description.references Zechner M, Muhr M, Kern R, Granitzer M (2009) External and intrinsic plagiarism detection using vector space models. In: SEPLN 2009, workshop on uncovering plagiarism, authorship, and social software misuse (PAN 09), pp 45–55 es_ES


Este ítem aparece en la(s) siguiente(s) colección(ones)

Mostrar el registro sencillo del ítem