- -

On the Use of Character n-grams as the only Intrinsic Evidence of Plagiarism

RiuNet: Repositorio Institucional de la Universidad Politécnica de Valencia

Compartir/Enviar a

Citas

Estadísticas

  • Estadisticas de Uso

On the Use of Character n-grams as the only Intrinsic Evidence of Plagiarism

Mostrar el registro sencillo del ítem

Ficheros en el ítem

dc.contributor.author Bensalem, Imene es_ES
dc.contributor.author Rosso, Paolo es_ES
dc.contributor.author Chikhi, Salim es_ES
dc.date.accessioned 2021-01-15T04:31:52Z
dc.date.available 2021-01-15T04:31:52Z
dc.date.issued 2019-09 es_ES
dc.identifier.issn 1574-020X es_ES
dc.identifier.uri http://hdl.handle.net/10251/159151
dc.description.abstract [EN] When a shift in writing style is noticed in a document, doubts arise about its originality. Based on this clue to plagiarism, the intrinsic approach to plagiarism detection identifies the stolen passages by analysing the writing style of the suspicious document without comparing it to textual resources that may serve as sources for the plagiarist. Character n-grams are recognised as a successful approach to modelling text for writing style analysis. Although prior studies have investigated the best practice of using character n-grams in authorship attribution and other problems, there is still a need for such investigations in the context of intrinsic plagiarism detection. Moreover, it has been assumed in previous works that the ways of using character n-grams in authorship attribution remain the same for intrinsic plagiarism detection. In this paper, we study the effect of character n-grams frequency and length on the performance of intrinsic plagiarism detection. Our experiments utilise two state-of-the-art methods and five large document collections of PAN labs written in English and Arabic. We demonstrate empirically that the low- and the high-frequency n-grams are not equally relevant for intrinsic plagiarism detection, but their performance depends on the way they are exploited. es_ES
dc.description.sponsorship We are very grateful to the anonymous reviewers for their insightful suggestions and constructive comments that greatly improved the paper. This work has been partially supported by the Ecole Superieure de Comptabilite et de Finances de Constantine. The work of Paolo Rosso has been partially funded by the SomEMBED TIN2015-71147-C2-1-P research project (MINECO/FEDER). The work of Salim Chikhi has been partially funded by CNEPRU/DGRSDT/B*07120140018 research project. es_ES
dc.language Inglés es_ES
dc.publisher Springer-Verlag es_ES
dc.relation.ispartof Language Resources and Evaluation es_ES
dc.rights Reserva de todos los derechos es_ES
dc.subject Intrinsic plagiarism detection es_ES
dc.subject Character n-grams es_ES
dc.subject Stylistic features es_ES
dc.subject Writing style analysis es_ES
dc.subject.classification LENGUAJES Y SISTEMAS INFORMATICOS es_ES
dc.title On the Use of Character n-grams as the only Intrinsic Evidence of Plagiarism es_ES
dc.type Artículo es_ES
dc.identifier.doi 10.1007/s10579-019-09444-w es_ES
dc.relation.projectID info:eu-repo/grantAgreement/MESRS//B*07120140018/ es_ES
dc.relation.projectID info:eu-repo/grantAgreement/MINECO//TIN2015-71147-C2-1-P/ES/COMPRENSION DEL LENGUAJE EN LOS MEDIOS DE COMUNICACION SOCIAL - REPRESENTANDO CONTEXTOS DE FORMA CONTINUA/ es_ES
dc.rights.accessRights Abierto es_ES
dc.contributor.affiliation Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació es_ES
dc.description.bibliographicCitation Bensalem, I.; Rosso, P.; Chikhi, S. (2019). On the Use of Character n-grams as the only Intrinsic Evidence of Plagiarism. Language Resources and Evaluation. 53(3):363-396. https://doi.org/10.1007/s10579-019-09444-w es_ES
dc.description.accrualMethod S es_ES
dc.relation.publisherversion https://doi.org/10.1007/s10579-019-09444-w es_ES
dc.description.upvformatpinicio 363 es_ES
dc.description.upvformatpfin 396 es_ES
dc.type.version info:eu-repo/semantics/publishedVersion es_ES
dc.description.volume 53 es_ES
dc.description.issue 3 es_ES
dc.relation.pasarela S\409339 es_ES
dc.contributor.funder European Regional Development Fund es_ES
dc.contributor.funder Ministère de l'Enseignement Supérieur et de la Recherche Scientifique, Túnez es_ES
dc.contributor.funder Ministerio de Economía y Competitividad es_ES
dc.description.references Akiva, N. (2012). Authorship and Plagiarism Detection Using Binary BOW Features. In CLEF 2012 evaluation labs and workshop—working notes papers, 17–20 September, Rome, Italy. es_ES
dc.description.references Akiva, N., & Koppel, M. (2013). A generic unsupervised method for decomposing multi-author documents. Journal of the American Society for Information Science and Technology, 64(11), 2256–2264. https://doi.org/10.1002/asi.22924 . es_ES
dc.description.references Aldebei, K., He, X., Jia, W., & Yang, J. (2016). Unsupervised multi-author document decomposition based on hidden Markov model. In Proceedings of the 54th annual meeting of the association for computational linguistics (ACL 2016) (pp. 706–714). es_ES
dc.description.references Aldebei, K., He, X., & Yang, J. (2015). Unsupervised decomposition of a multi-author document based on Naive-Bayesian Model. In In Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (volume 2: short papers) (pp. 501–505). es_ES
dc.description.references Anguera, X., Bozonnet, S., Evans, N., Fredouille, C., Friedland, G., & Vinyals, O. (2012). Speaker diarization: A review of recent research. IEEE Transactions on Audio, Speech and Language Processing, 20(2), 356–370. es_ES
dc.description.references Bensalem, I., Boukhalfa, I., Rosso, P., Abouenour, L., Darwish, K., & Chikhi, S. (2015). Overview of the AraPlagDet PAN@FIRE2015 Shared task on Arabic Plagiarism Detection. In P. Majumder, M. Mitra, M. Agrawal, & P. Mitra (Eds.), Post proceedings of the workshops at the 7th forum for information retrieval evaluation (FIRE 2015), Gandhinagar, India (pp. 111–122). CEUR-WS.org. es_ES
dc.description.references Bensalem, I., Rosso, P., & Chikhi, S. (2013a). A new corpus for the evaluation of Arabic intrinsic plagiarism detection. In P. Forner, H. Müller, R. Paredes, P. Rosso, & B. Stein (Eds.), CLEF 2013, LNCS, vol. 8138 (pp. 53–58). Heidelberg: Springer. https://doi.org/10.1007/978-3-642-40802-1_6 . es_ES
dc.description.references Bensalem, I., Rosso, P., & Chikhi, S. (2013b). Building Arabic corpora from Wikisource. In 2013 ACS international conference on computer systems and applications (AICCSA), Fes/Ifran, Morocco (pp. 1–2). IEEE. https://doi.org/10.1109/aiccsa.2013.6616474 . es_ES
dc.description.references Bensalem, I., Rosso, P., & Chikhi, S. (2014). Intrinsic plagiarism detection using n-gram classes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), Doha, Qatar, October 25–29 (pp. 1459–1464). Association for Computational Linguistics. es_ES
dc.description.references Brocardo, M. L., Traore, I., Saad, S., & Woungang, I. (2013). Authorship verification for short messages using stylometry. In 2013 International conference on computer, information and telecommunication systems (CITS 2013) (pp. 1–6). IEEE. https://doi.org/10.1109/cits.2013.6705711 . es_ES
dc.description.references Brooke, J., & Hirst, G. (2012). Paragraph clustering for intrinsic plagiarism detection using a stylistic vector-space model with extrinsic features—Notebook for PAN at CLEF 2012. In CLEF 2012 Evaluation labs and workshop—Working notes papers, 17-20 September, Rome, Italy. es_ES
dc.description.references Burn-Thornton, K., & Burman, T. (2015). A novel approach for analysis of ‘real world’ data: A data mining engine for identification of multi-author student document submission. In M. Abou-Nasr, S. Lessmann, R. Stahlbock, & G. M. Weiss (Eds.), Real world data mining applications (Vol. 17, pp. 203–219). Springer International Publishing. https://doi.org/10.1007/978-3-319-07812-0_11 . es_ES
dc.description.references Giannella, C. (2016). An improved algorithm for unsupervised decomposition of a multi author document. Journal of the Association for Information Science and Technology, 67(2), 400–411. es_ES
dc.description.references Gillam, L., Marinuzzi, J., & Ioannou, P. (2011). TurnItOff-defeating plagiarism detection systems. In Proceedings of the 11th higher education academy-ics annual conference. Higher Education Academy. es_ES
dc.description.references Gipp, B., Meuschke, N., & Beel, J. (2011). Comparative evaluation of text- and citation-based plagiarism detection approaches using GuttenPlag. In Proceeding of the 11th annual international ACM/IEEE joint conference on Digital libraries (pp. 255–258). es_ES
dc.description.references Glover, A., & Hirst, G. (1996). Detecting stylistic inconsistencies in collaborative writing. In M. Sharples & T. van der Geest (Eds.), The new writing environment (pp. 147–168). London: Springer. https://doi.org/10.1007/978-1-4471-1482-6_12 . es_ES
dc.description.references Graham, N., Hirst, G., & Marthi, B. (2005). Segmenting documents by stylistic character. Natural Language Engineering, 11(04), 397–415. https://doi.org/10.1017/S1351324905003694 . es_ES
dc.description.references Grozea, C., & Popescu, M. (2010). Who’ s the thief? Automatic detection of the direction of plagiarism. In CICLing 2010, Iaşi, Romania, March 21–27, LNCS, vol. 6008 (pp. 700–710). Springer, Berlin. https://doi.org/10.1007/978-3-642-12116-6_59 . es_ES
dc.description.references Guthrie, D., Guthrie, L., Allison, B., & Wilks, Y. (2007). Unsupervised anomaly detection. In IJCAI international joint conference on artificial intelligence (pp. 1624–1628). Morgan Kaufmann Publishers, Burlington. es_ES
dc.description.references Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA data mining software: An update. SIGKDD Explorations, 11(1), 10–18. https://doi.org/10.1145/1656274.1656278 . es_ES
dc.description.references Heather, J. (2010). Turnitoff: Identifying and fixing a hole in current plagiarism detection software. Assessment & Evaluation in Higher Education, 35(6), 647–660. https://doi.org/10.1080/02602938.2010.486471 . es_ES
dc.description.references Houvardas, J., & Stamatatos, E. (2006). N-gram feature selection for authorship identification. In International conference on artificial intelligence: Methodology, systems, and applications (pp. 77–86). es_ES
dc.description.references Jankowska, M., Milios, E., & Kešelj, V. (2014). Author verification using common n-gram profiles of text documents. In Proceedings of COLING 2014, the 25th international conference on computational linguistics (pp. 387–397). es_ES
dc.description.references Kasprzak, J., & Brandejs, M. (2010). Improving the reliability of the plagiarism detection system lab report for PAN at CLEF 2010. In Notebook papers of CLEF 2010 LABs and workshops, September 22–23, Padua, Italy. es_ES
dc.description.references Keogh, E., Chu, S., Hart, D., & Pazzani, M. (2004). Segmenting time series: A survey and novel approach. In H. Bunke (Ed.), Data mining in time series databases (pp. 1–15). Singapore: World Scientific Publishing. es_ES
dc.description.references Kern, R., & Granitzer, M. (2009). Efficient linear text segmentation based on information retrieval techniques. In Proceedings of the international conference on management of emergent digital ecosystems—MEDES’09. ACM Press. https://doi.org/10.1145/1643823.1643854 . es_ES
dc.description.references Kern, R., Klampfl, S., & Zechner, M. (2012). Vote/veto classification, ensemble clustering and sequence classification for author identification—Notebook of PAN at CLEF 2012. Working notes papers of the CLEF 2012 evaluation labs (pp. 1–15). es_ES
dc.description.references Kešelj, V., Peng, F., Cercone, N., & Thomas, C. (2003). N-gram-based author profiles for authorship attribution. In Proceedings of the conference pacific association for computational linguistics, PA- CLING’03 (pp. 255–264). es_ES
dc.description.references Kestemont, M., Luyckx, K., & Daelemans, W. (2011). Intrinsic Plagiarism detection using character trigram distance scores—Notebook for PAN at CLEF 2011. In Notebook papers of CLEF 2011 LABs and workshops, September 19–22, Amsterdam, The Netherlands. es_ES
dc.description.references Koppel, M., Akiva, N., Dershowitz, I., & Dershowitz, N. (2011). Unsupervised decomposition of a document into authorial components. In Proceedings of the 49th annual meeting of the association for computational linguistics (pp. 1356–1364). Association for Computational Linguistics. es_ES
dc.description.references Kuta, M., & Kitowski, J. (2014). Optimisation of character n-gram profiles method for intrinsic plagiarism Detection. In L. Rutkowski, M. Korytkowski, R. Scherer, R. Tadeusiewicz, L. A. Zadeh, & J. M. Zurada (Eds.), ICAISC 2014, Part II, LNAI, vol. 8468 (pp. 500–511). Springer. https://doi.org/10.1007/978-3-319-07176-3_44 . es_ES
dc.description.references Kuznetsov, M., Motrenko, A., Kuznetsova, R., & Strijov, V. (2016). Methods for intrinsic plagiarism detection and author diarization Notebook for PAN at CLEF 2016. In Working notes of CLEF 2016—Conference and labs of the evaluation forum Évora, Portugal, 5–8 September, 2016 (pp. 912–919). CEUR-WS.org. es_ES
dc.description.references Mahgoub, A. Y., Magooda, A., Rashwan, M., Fayek, M. B., & Raafat, H. (2015). RDI system for intrinsic plagiarism detection (RDI_RID) Working notes for PAN-AraPlagDet at FIRE 2015. In Workshops proceedings of the seventh international forum for information retrieval evaluation (FIRE 2015), Gandhinagar, India (pp. 129–130). CEUR-WS.org. es_ES
dc.description.references Meyer zu Eißen, S., Stein, B., & Kulig, M. (2007). Plagiarism detection without reference collections. In R. Decker & H.-J. Lenz (Eds.), Advances in data analysis, selected papers from the 30th annual conference of the german classification society (GfKl), Berlin, (pp. 359–366). Heidelberg: Springer. https://doi.org/10.1007/978-3-540-70981-7_40 . es_ES
dc.description.references Muhr, M., Kern, R., Zechner, M., & Granitzer, M. (2010). External and intrinsic plagiarism detection using a cross-lingual retrieval and segmentation system—Lab report for PAN at CLEF 2010. In Notebook papers of CLEF 2010 LABs and Workshops, September 22–23, Padua, Italy. es_ES
dc.description.references Oberreuter, G., & Velásquez, J. D. (2013). Text mining applied to plagiarism detection: The use of words for detecting deviations in the writing style. Expert Systems with Applications, 40(9), 3756–3763. https://doi.org/10.1016/j.eswa.2012.12.082 . es_ES
dc.description.references Pertile, S. D. L., Moreira, V. P., & Rosso, P. (2015). Comparing and combining content- and citation-based approaches for plagiarism detection. Journal of the Association for Information Science and Technology, 67(10), 2511–2526. https://doi.org/10.1002/asi.23593 . es_ES
dc.description.references Potthast, M., Barrón-cedeño, A., Eiselt, A., Stein, B., & Rosso, P. (2010). Overview of the 2nd International competition on plagiarism detection. In M. Braschler & D. Harman (Eds.), Notebook papers of CLEF 2010 LABs and workshops, September 22–23, Padua, Italy. es_ES
dc.description.references Potthast, M., Eiselt, A., Barrón-Cedeño, A., Stein, B., & Rosso, P. (2011). Overview of the 3rd international competition on plagiarism detection. In V. Petras, P. Forner, & P. Clough (Eds.), Notebook papers of CLEF 2011 LABs and workshops, September 19–22. Amsterdam, The Netherland. es_ES
dc.description.references Potthast, M., Stein, B., Barrón-Cedeño, A., & Rosso, P. (2010). An evaluation framework for plagiarism Detection. In C.-R. Huang & D. Jurafsky (Eds.), Proceedings of the 23rd international conference on computational linguistics (COLING 2010) (pp. 997–1005). Stroudsburg, USA: Association for Computational Linguistics. es_ES
dc.description.references Potthast, M., Stein, B., Eiselt, A., Barrón-Cedeño, A., & Rosso, P. (2009). Overview of the 1st international competition on plagiarism detection. In B. Stein, P. Rosso, E. Stamatatos, M. Koppel, & E. Agirre (Eds.), Proceedings of the SEPLN’09 workshop on uncovering plagiarism, authorship and social software misuse (PAN 09) (pp. 1–9). CEUR-WS.org. es_ES
dc.description.references Rao, S., Gupta, P., Singhal, K., & Majumder, P. (2011). External & intrinsic plagiarism detection: VSM & discourse markers based approach—Notebook for PAN at CLEF 2011. In Notebook papers of CLEF 2011 LABs and workshops, September 19–22, Amsterdam, The Netherlands (pp. 2–6). es_ES
dc.description.references Rosso, P., Rangel, F., Potthast, M., Stamatatos, E., Tschuggnall, M., & Stein, B. (2016). Overview of PAN’16: New challenges for authorship analysis: Cross-Genre profiling, clustering, Diarization, and Obfuscation. In N. Fuhr, P. Quaresma, T. Gonçalves, B. Larsen, K. Balog, C. Macdonald, et al. (Eds.), CLEF 2016, LNCS 9822 (pp. 332–350). Springer. https://doi.org/10.1007/978-3-319-44564-9_28 . es_ES
dc.description.references Sapkota, U., Bethard, S., y Gómez, M. M., & Solorio, T. (2015). Not all character n-grams are created equal: A study in authorship attribution. In 2015 conference of the north american chapter of the association for computational linguistics—Human Language Technologies (NAACL HLT 2015) (pp. 93–102). https://doi.org/10.3115/v1/n15-1010 . es_ES
dc.description.references Shrestha, P., & Solorio, T. (2015). Identification of original document by using textual similarities. In A. Gelbukh (Ed.), CICLing 2015, Part II, LNCS 9042 (pp. 643–654). Springer. https://doi.org/10.1007/978-3-319-18117-2_48 . es_ES
dc.description.references Stamatatos, E. (2009a). Intrinsic plagiarism detection using character n-gram profiles. In B. Stein, P. Rosso, E. Stamatatos, M. Koppel, & E. Agirre (Eds.), Proceedings of the SEPLN’09 workshop on uncovering plagiarism, authorship and social software misuse (PAN 09) (pp. 38–46). CEUR-WS.org. es_ES
dc.description.references Stamatatos, E. (2009b). A survey of modern authorship attribution methods. Journal of the American Society for Information Science, 60(3), 538–556. https://doi.org/10.1002/asi.21001 . es_ES
dc.description.references Stamatatos, E. (2013). On the robustness of authorship attribution based on character n-gram features. Journal of Law & Policy, 21(2), 421–439. es_ES
dc.description.references Stamatatos, E. (2016). Universality of stylistic traits in texts. In M. D. Esposti, E. G. Altmann, & F. Pachet (Eds.), Creativity and universality in language (pp. 143–155). Springer. https://doi.org/10.1007/978-3-319-24403-7_9 . es_ES
dc.description.references Stamatatos, E., Tschuggnall, M., Verhoeven, B., Daelemans, W., Specht, G., Stein, B., & Potthast, M. (2016). Clustering by authorship within and across documents. In Working notes of CLEF 2016—Conference and labs of the evaluation forum Évora, Portugal, 5–8 September, 2016 (pp. 691–715). CEUR-WS.org. es_ES
dc.description.references Stein, B., Lipka, N., & Prettenhofer, P. (2011). Intrinsic plagiarism analysis. Language Resources and Evaluation, 45(1), 63–82. https://doi.org/10.1007/s10579-010-9115-y . es_ES
dc.description.references Suárez, P., González, J. C., & Villena-Román, J. (2010). A plagiarism detector for intrinsic plagiarism—Lab Report for PAN at CLEF 2010. In Notebook papers of CLEF 2010 LABs and workshops, September 22–23, Padua, Italy. es_ES
dc.description.references Suchomel, Š., Kasprzak, J., & Brandejs, M. (2012). Three way search engine queries with multi-feature document comparison for plagiarism detection—Notebook for PAN at CLEF 2012. In CLEF 2012 evaluation labs and workshop—Working notes papers, 17–20 September, Rome, Italy. es_ES
dc.description.references Tschuggnall, M., & Specht, G. (2014). Automatic decomposition of multi-author documents using grammar analysis. In Proceedings of the 26th GI-workshop on foundations of databases (Grundlagen von Datenbanken) (pp. 17–22). CEUR-WS.org. es_ES
dc.description.references Tschuggnall, M., Stamatatos, E., Verhoeven, B., Daelemans, W., Specht, G., Stein, B., & Potthast, M. (2017). Overview of the author identification task at PAN-2017: Style breach detection and author clustering. In L. Cappellato, N. Ferro, L. Goeuriot, & T. Mandl (Eds.), Working notes papers of the CLEF 2017 evaluation labs volume 1866 of CEUR workshop proceedings, September 2017. CLEF and CEUR-WS.org. es_ES
dc.description.references van Halteren, H. (2003). Detection of plagiarism in student essays. In Computational linguistics in the Netherlands 2003: Selected papers from the fourteenth CLIN meeting (pp. 157–169). es_ES
dc.description.references van Halteren, H. (2004). Linguistic profiling for author recognition and verification. In Proceedings of the 42nd annual meeting on association for computational linguistics (p. Article No. 199). Association for Computational Linguistics. https://doi.org/10.3115/1218955.1218981 . es_ES
dc.description.references Zečević, A. (2011). N-gram based text classification according to authorship. In Proceedings of the student research workshop associated with RANLP 2011 (pp. 145–149). Hissar, Bulgaria: Association for Computational Linguistics. es_ES
dc.description.references Zechner, M., Muhr, M., Kern, R., & Granitzer, M. (2009). External and intrinsic plagiarism detection using vector space models. In B. Stein, P. Rosso, E. Stamatatos, M. Koppel, & E. Agirre (Eds.), Proceedings of the SEPLN’09 workshop on uncovering plagiarism, authorship and social software misuse (PAN 09) (pp. 47–55). CEUR-WS.org. es_ES


Este ítem aparece en la(s) siguiente(s) colección(ones)

Mostrar el registro sencillo del ítem