- -

On the Use of Character n-grams as the only Intrinsic Evidence of Plagiarism

RiuNet: Institutional repository of the Polithecnic University of Valencia

Share/Send to

Cited by

Statistics

On the Use of Character n-grams as the only Intrinsic Evidence of Plagiarism

Show full item record

Bensalem, I.; Rosso, P.; Chikhi, S. (2019). On the Use of Character n-grams as the only Intrinsic Evidence of Plagiarism. Language Resources and Evaluation. 53(3):363-396. https://doi.org/10.1007/s10579-019-09444-w

Por favor, use este identificador para citar o enlazar este ítem: http://hdl.handle.net/10251/159151

Files in this item

Item Metadata

Title: On the Use of Character n-grams as the only Intrinsic Evidence of Plagiarism
Author: Bensalem, Imene Rosso, Paolo Chikhi, Salim
UPV Unit: Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació
Issued date:
Abstract:
[EN] When a shift in writing style is noticed in a document, doubts arise about its originality. Based on this clue to plagiarism, the intrinsic approach to plagiarism detection identifies the stolen passages by analysing ...[+]
Subjects: Intrinsic plagiarism detection , Character n-grams , Stylistic features , Writing style analysis
Copyrigths: Reserva de todos los derechos
Source:
Language Resources and Evaluation. (issn: 1574-020X )
DOI: 10.1007/s10579-019-09444-w
Publisher:
Springer-Verlag
Publisher version: https://doi.org/10.1007/s10579-019-09444-w
Project ID:
MESRS/CNEPRU/DGRSDT/B*07120140018
MINISTERIO DE ECONOMIA Y EMPRESA/TIN2015-71147-C2-1-P
Thanks:
We are very grateful to the anonymous reviewers for their insightful suggestions and constructive comments that greatly improved the paper. This work has been partially supported by the Ecole Superieure de Comptabilite et ...[+]
Type: Artículo

References

Akiva, N. (2012). Authorship and Plagiarism Detection Using Binary BOW Features. In CLEF 2012 evaluation labs and workshop—working notes papers, 17–20 September, Rome, Italy.

Akiva, N., & Koppel, M. (2013). A generic unsupervised method for decomposing multi-author documents. Journal of the American Society for Information Science and Technology, 64(11), 2256–2264. https://doi.org/10.1002/asi.22924 .

Aldebei, K., He, X., Jia, W., & Yang, J. (2016). Unsupervised multi-author document decomposition based on hidden Markov model. In Proceedings of the 54th annual meeting of the association for computational linguistics (ACL 2016) (pp. 706–714). [+]
Akiva, N. (2012). Authorship and Plagiarism Detection Using Binary BOW Features. In CLEF 2012 evaluation labs and workshop—working notes papers, 17–20 September, Rome, Italy.

Akiva, N., & Koppel, M. (2013). A generic unsupervised method for decomposing multi-author documents. Journal of the American Society for Information Science and Technology, 64(11), 2256–2264. https://doi.org/10.1002/asi.22924 .

Aldebei, K., He, X., Jia, W., & Yang, J. (2016). Unsupervised multi-author document decomposition based on hidden Markov model. In Proceedings of the 54th annual meeting of the association for computational linguistics (ACL 2016) (pp. 706–714).

Aldebei, K., He, X., & Yang, J. (2015). Unsupervised decomposition of a multi-author document based on Naive-Bayesian Model. In In Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (volume 2: short papers) (pp. 501–505).

Anguera, X., Bozonnet, S., Evans, N., Fredouille, C., Friedland, G., & Vinyals, O. (2012). Speaker diarization: A review of recent research. IEEE Transactions on Audio, Speech and Language Processing, 20(2), 356–370.

Bensalem, I., Boukhalfa, I., Rosso, P., Abouenour, L., Darwish, K., & Chikhi, S. (2015). Overview of the AraPlagDet PAN@FIRE2015 Shared task on Arabic Plagiarism Detection. In P. Majumder, M. Mitra, M. Agrawal, & P. Mitra (Eds.), Post proceedings of the workshops at the 7th forum for information retrieval evaluation (FIRE 2015), Gandhinagar, India (pp. 111–122). CEUR-WS.org.

Bensalem, I., Rosso, P., & Chikhi, S. (2013a). A new corpus for the evaluation of Arabic intrinsic plagiarism detection. In P. Forner, H. Müller, R. Paredes, P. Rosso, & B. Stein (Eds.), CLEF 2013, LNCS, vol. 8138 (pp. 53–58). Heidelberg: Springer. https://doi.org/10.1007/978-3-642-40802-1_6 .

Bensalem, I., Rosso, P., & Chikhi, S. (2013b). Building Arabic corpora from Wikisource. In 2013 ACS international conference on computer systems and applications (AICCSA), Fes/Ifran, Morocco (pp. 1–2). IEEE. https://doi.org/10.1109/aiccsa.2013.6616474 .

Bensalem, I., Rosso, P., & Chikhi, S. (2014). Intrinsic plagiarism detection using n-gram classes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), Doha, Qatar, October 25–29 (pp. 1459–1464). Association for Computational Linguistics.

Brocardo, M. L., Traore, I., Saad, S., & Woungang, I. (2013). Authorship verification for short messages using stylometry. In 2013 International conference on computer, information and telecommunication systems (CITS 2013) (pp. 1–6). IEEE. https://doi.org/10.1109/cits.2013.6705711 .

Brooke, J., & Hirst, G. (2012). Paragraph clustering for intrinsic plagiarism detection using a stylistic vector-space model with extrinsic features—Notebook for PAN at CLEF 2012. In CLEF 2012 Evaluation labs and workshop—Working notes papers, 17-20 September, Rome, Italy.

Burn-Thornton, K., & Burman, T. (2015). A novel approach for analysis of ‘real world’ data: A data mining engine for identification of multi-author student document submission. In M. Abou-Nasr, S. Lessmann, R. Stahlbock, & G. M. Weiss (Eds.), Real world data mining applications (Vol. 17, pp. 203–219). Springer International Publishing. https://doi.org/10.1007/978-3-319-07812-0_11 .

Giannella, C. (2016). An improved algorithm for unsupervised decomposition of a multi author document. Journal of the Association for Information Science and Technology, 67(2), 400–411.

Gillam, L., Marinuzzi, J., & Ioannou, P. (2011). TurnItOff-defeating plagiarism detection systems. In Proceedings of the 11th higher education academy-ics annual conference. Higher Education Academy.

Gipp, B., Meuschke, N., & Beel, J. (2011). Comparative evaluation of text- and citation-based plagiarism detection approaches using GuttenPlag. In Proceeding of the 11th annual international ACM/IEEE joint conference on Digital libraries (pp. 255–258).

Glover, A., & Hirst, G. (1996). Detecting stylistic inconsistencies in collaborative writing. In M. Sharples & T. van der Geest (Eds.), The new writing environment (pp. 147–168). London: Springer. https://doi.org/10.1007/978-1-4471-1482-6_12 .

Graham, N., Hirst, G., & Marthi, B. (2005). Segmenting documents by stylistic character. Natural Language Engineering, 11(04), 397–415. https://doi.org/10.1017/S1351324905003694 .

Grozea, C., & Popescu, M. (2010). Who’ s the thief? Automatic detection of the direction of plagiarism. In CICLing 2010, Iaşi, Romania, March 21–27, LNCS, vol. 6008 (pp. 700–710). Springer, Berlin. https://doi.org/10.1007/978-3-642-12116-6_59 .

Guthrie, D., Guthrie, L., Allison, B., & Wilks, Y. (2007). Unsupervised anomaly detection. In IJCAI international joint conference on artificial intelligence (pp. 1624–1628). Morgan Kaufmann Publishers, Burlington.

Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA data mining software: An update. SIGKDD Explorations, 11(1), 10–18. https://doi.org/10.1145/1656274.1656278 .

Heather, J. (2010). Turnitoff: Identifying and fixing a hole in current plagiarism detection software. Assessment & Evaluation in Higher Education, 35(6), 647–660. https://doi.org/10.1080/02602938.2010.486471 .

Houvardas, J., & Stamatatos, E. (2006). N-gram feature selection for authorship identification. In International conference on artificial intelligence: Methodology, systems, and applications (pp. 77–86).

Jankowska, M., Milios, E., & Kešelj, V. (2014). Author verification using common n-gram profiles of text documents. In Proceedings of COLING 2014, the 25th international conference on computational linguistics (pp. 387–397).

Kasprzak, J., & Brandejs, M. (2010). Improving the reliability of the plagiarism detection system lab report for PAN at CLEF 2010. In Notebook papers of CLEF 2010 LABs and workshops, September 22–23, Padua, Italy.

Keogh, E., Chu, S., Hart, D., & Pazzani, M. (2004). Segmenting time series: A survey and novel approach. In H. Bunke (Ed.), Data mining in time series databases (pp. 1–15). Singapore: World Scientific Publishing.

Kern, R., & Granitzer, M. (2009). Efficient linear text segmentation based on information retrieval techniques. In Proceedings of the international conference on management of emergent digital ecosystems—MEDES’09. ACM Press. https://doi.org/10.1145/1643823.1643854 .

Kern, R., Klampfl, S., & Zechner, M. (2012). Vote/veto classification, ensemble clustering and sequence classification for author identification—Notebook of PAN at CLEF 2012. Working notes papers of the CLEF 2012 evaluation labs (pp. 1–15).

Kešelj, V., Peng, F., Cercone, N., & Thomas, C. (2003). N-gram-based author profiles for authorship attribution. In Proceedings of the conference pacific association for computational linguistics, PA- CLING’03 (pp. 255–264).

Kestemont, M., Luyckx, K., & Daelemans, W. (2011). Intrinsic Plagiarism detection using character trigram distance scores—Notebook for PAN at CLEF 2011. In Notebook papers of CLEF 2011 LABs and workshops, September 19–22, Amsterdam, The Netherlands.

Koppel, M., Akiva, N., Dershowitz, I., & Dershowitz, N. (2011). Unsupervised decomposition of a document into authorial components. In Proceedings of the 49th annual meeting of the association for computational linguistics (pp. 1356–1364). Association for Computational Linguistics.

Kuta, M., & Kitowski, J. (2014). Optimisation of character n-gram profiles method for intrinsic plagiarism Detection. In L. Rutkowski, M. Korytkowski, R. Scherer, R. Tadeusiewicz, L. A. Zadeh, & J. M. Zurada (Eds.), ICAISC 2014, Part II, LNAI, vol. 8468 (pp. 500–511). Springer. https://doi.org/10.1007/978-3-319-07176-3_44 .

Kuznetsov, M., Motrenko, A., Kuznetsova, R., & Strijov, V. (2016). Methods for intrinsic plagiarism detection and author diarization Notebook for PAN at CLEF 2016. In Working notes of CLEF 2016—Conference and labs of the evaluation forum Évora, Portugal, 5–8 September, 2016 (pp. 912–919). CEUR-WS.org.

Mahgoub, A. Y., Magooda, A., Rashwan, M., Fayek, M. B., & Raafat, H. (2015). RDI system for intrinsic plagiarism detection (RDI_RID) Working notes for PAN-AraPlagDet at FIRE 2015. In Workshops proceedings of the seventh international forum for information retrieval evaluation (FIRE 2015), Gandhinagar, India (pp. 129–130). CEUR-WS.org.

Meyer zu Eißen, S., Stein, B., & Kulig, M. (2007). Plagiarism detection without reference collections. In R. Decker & H.-J. Lenz (Eds.), Advances in data analysis, selected papers from the 30th annual conference of the german classification society (GfKl), Berlin, (pp. 359–366). Heidelberg: Springer. https://doi.org/10.1007/978-3-540-70981-7_40 .

Muhr, M., Kern, R., Zechner, M., & Granitzer, M. (2010). External and intrinsic plagiarism detection using a cross-lingual retrieval and segmentation system—Lab report for PAN at CLEF 2010. In Notebook papers of CLEF 2010 LABs and Workshops, September 22–23, Padua, Italy.

Oberreuter, G., & Velásquez, J. D. (2013). Text mining applied to plagiarism detection: The use of words for detecting deviations in the writing style. Expert Systems with Applications, 40(9), 3756–3763. https://doi.org/10.1016/j.eswa.2012.12.082 .

Pertile, S. D. L., Moreira, V. P., & Rosso, P. (2015). Comparing and combining content- and citation-based approaches for plagiarism detection. Journal of the Association for Information Science and Technology, 67(10), 2511–2526. https://doi.org/10.1002/asi.23593 .

Potthast, M., Barrón-cedeño, A., Eiselt, A., Stein, B., & Rosso, P. (2010). Overview of the 2nd International competition on plagiarism detection. In M. Braschler & D. Harman (Eds.), Notebook papers of CLEF 2010 LABs and workshops, September 22–23, Padua, Italy.

Potthast, M., Eiselt, A., Barrón-Cedeño, A., Stein, B., & Rosso, P. (2011). Overview of the 3rd international competition on plagiarism detection. In V. Petras, P. Forner, & P. Clough (Eds.), Notebook papers of CLEF 2011 LABs and workshops, September 19–22. Amsterdam, The Netherland.

Potthast, M., Stein, B., Barrón-Cedeño, A., & Rosso, P. (2010). An evaluation framework for plagiarism Detection. In C.-R. Huang & D. Jurafsky (Eds.), Proceedings of the 23rd international conference on computational linguistics (COLING 2010) (pp. 997–1005). Stroudsburg, USA: Association for Computational Linguistics.

Potthast, M., Stein, B., Eiselt, A., Barrón-Cedeño, A., & Rosso, P. (2009). Overview of the 1st international competition on plagiarism detection. In B. Stein, P. Rosso, E. Stamatatos, M. Koppel, & E. Agirre (Eds.), Proceedings of the SEPLN’09 workshop on uncovering plagiarism, authorship and social software misuse (PAN 09) (pp. 1–9). CEUR-WS.org.

Rao, S., Gupta, P., Singhal, K., & Majumder, P. (2011). External & intrinsic plagiarism detection: VSM & discourse markers based approach—Notebook for PAN at CLEF 2011. In Notebook papers of CLEF 2011 LABs and workshops, September 19–22, Amsterdam, The Netherlands (pp. 2–6).

Rosso, P., Rangel, F., Potthast, M., Stamatatos, E., Tschuggnall, M., & Stein, B. (2016). Overview of PAN’16: New challenges for authorship analysis: Cross-Genre profiling, clustering, Diarization, and Obfuscation. In N. Fuhr, P. Quaresma, T. Gonçalves, B. Larsen, K. Balog, C. Macdonald, et al. (Eds.), CLEF 2016, LNCS 9822 (pp. 332–350). Springer. https://doi.org/10.1007/978-3-319-44564-9_28 .

Sapkota, U., Bethard, S., y Gómez, M. M., & Solorio, T. (2015). Not all character n-grams are created equal: A study in authorship attribution. In 2015 conference of the north american chapter of the association for computational linguistics—Human Language Technologies (NAACL HLT 2015) (pp. 93–102). https://doi.org/10.3115/v1/n15-1010 .

Shrestha, P., & Solorio, T. (2015). Identification of original document by using textual similarities. In A. Gelbukh (Ed.), CICLing 2015, Part II, LNCS 9042 (pp. 643–654). Springer. https://doi.org/10.1007/978-3-319-18117-2_48 .

Stamatatos, E. (2009a). Intrinsic plagiarism detection using character n-gram profiles. In B. Stein, P. Rosso, E. Stamatatos, M. Koppel, & E. Agirre (Eds.), Proceedings of the SEPLN’09 workshop on uncovering plagiarism, authorship and social software misuse (PAN 09) (pp. 38–46). CEUR-WS.org.

Stamatatos, E. (2009b). A survey of modern authorship attribution methods. Journal of the American Society for Information Science, 60(3), 538–556. https://doi.org/10.1002/asi.21001 .

Stamatatos, E. (2013). On the robustness of authorship attribution based on character n-gram features. Journal of Law & Policy, 21(2), 421–439.

Stamatatos, E. (2016). Universality of stylistic traits in texts. In M. D. Esposti, E. G. Altmann, & F. Pachet (Eds.), Creativity and universality in language (pp. 143–155). Springer. https://doi.org/10.1007/978-3-319-24403-7_9 .

Stamatatos, E., Tschuggnall, M., Verhoeven, B., Daelemans, W., Specht, G., Stein, B., & Potthast, M. (2016). Clustering by authorship within and across documents. In Working notes of CLEF 2016—Conference and labs of the evaluation forum Évora, Portugal, 5–8 September, 2016 (pp. 691–715). CEUR-WS.org.

Stein, B., Lipka, N., & Prettenhofer, P. (2011). Intrinsic plagiarism analysis. Language Resources and Evaluation, 45(1), 63–82. https://doi.org/10.1007/s10579-010-9115-y .

Suárez, P., González, J. C., & Villena-Román, J. (2010). A plagiarism detector for intrinsic plagiarism—Lab Report for PAN at CLEF 2010. In Notebook papers of CLEF 2010 LABs and workshops, September 22–23, Padua, Italy.

Suchomel, Š., Kasprzak, J., & Brandejs, M. (2012). Three way search engine queries with multi-feature document comparison for plagiarism detection—Notebook for PAN at CLEF 2012. In CLEF 2012 evaluation labs and workshop—Working notes papers, 17–20 September, Rome, Italy.

Tschuggnall, M., & Specht, G. (2014). Automatic decomposition of multi-author documents using grammar analysis. In Proceedings of the 26th GI-workshop on foundations of databases (Grundlagen von Datenbanken) (pp. 17–22). CEUR-WS.org.

Tschuggnall, M., Stamatatos, E., Verhoeven, B., Daelemans, W., Specht, G., Stein, B., & Potthast, M. (2017). Overview of the author identification task at PAN-2017: Style breach detection and author clustering. In L. Cappellato, N. Ferro, L. Goeuriot, & T. Mandl (Eds.), Working notes papers of the CLEF 2017 evaluation labs volume 1866 of CEUR workshop proceedings, September 2017. CLEF and CEUR-WS.org.

van Halteren, H. (2003). Detection of plagiarism in student essays. In Computational linguistics in the Netherlands 2003: Selected papers from the fourteenth CLIN meeting (pp. 157–169).

van Halteren, H. (2004). Linguistic profiling for author recognition and verification. In Proceedings of the 42nd annual meeting on association for computational linguistics (p. Article No. 199). Association for Computational Linguistics. https://doi.org/10.3115/1218955.1218981 .

Zečević, A. (2011). N-gram based text classification according to authorship. In Proceedings of the student research workshop associated with RANLP 2011 (pp. 145–149). Hissar, Bulgaria: Association for Computational Linguistics.

Zechner, M., Muhr, M., Kern, R., & Granitzer, M. (2009). External and intrinsic plagiarism detection using vector space models. In B. Stein, P. Rosso, E. Stamatatos, M. Koppel, & E. Agirre (Eds.), Proceedings of the SEPLN’09 workshop on uncovering plagiarism, authorship and social software misuse (PAN 09) (pp. 47–55). CEUR-WS.org.

[-]

This item appears in the following Collection(s)

Show full item record