- -

Cross-Language Plagiarism Detection

RiuNet: Repositorio Institucional de la Universidad Politécnica de Valencia

Compartir/Enviar a

Citas

Estadísticas

  • Estadisticas de Uso

Cross-Language Plagiarism Detection

Mostrar el registro sencillo del ítem

Ficheros en el ítem

dc.contributor.author Potthast, Martin es_ES
dc.contributor.author Barrón Cedeño, Luis Alberto es_ES
dc.contributor.author Stein, Benno es_ES
dc.contributor.author Rosso, Paolo es_ES
dc.date.accessioned 2014-05-14T12:32:31Z
dc.date.issued 2011-03
dc.identifier.issn 1574-020X
dc.identifier.uri http://hdl.handle.net/10251/37479
dc.description.abstract Cross-language plagiarism detection deals with the automatic identification and extraction of plagiarism in a multilingual setting. In this setting, a suspicious document is given, and the task is to retrieve all sections from the document that originate from a large, multilingual document collection. Our contributions in this field are as follows: (1) a comprehensive retrieval process for cross-language plagiarism detection is introduced, highlighting the differences to monolingual plagiarism detection, (2) state-of-the-art solutions for two important subtasks are reviewed, (3) retrieval models for the assessment of cross-language similarity are surveyed, and, (4) the three models CL-CNG, CL-ESA and CL-ASA are compared. Our evaluation is of realistic scale: it relies on 120,000 test documents which are selected from the corpora JRC-Acquis and Wikipedia, so that for each test document highly similar documents are available in all of the six languages English, German, Spanish, French, Dutch, and Polish. The models are employed in a series of ranking tasks, and more than 100 million similarities are computed with each model. The results of our evaluation indicate that CL-CNG, despite its simple approach, is the best choice to rank and compare texts across languages if they are syntactically related. CL-ESA almost matches the performance of CL-CNG, but on arbitrary pairs of languages. CL-ASA works best on "exact" translations but does not generalize well. es_ES
dc.description.sponsorship This work was partially supported by the TEXT-ENTERPRISE 2.0 TIN2009-13391-C04-03 project and the CONACyT-Mexico 192021 grant. en_EN
dc.format.extent 18 es_ES
dc.language Inglés es_ES
dc.publisher Springer Verlag (Germany) es_ES
dc.relation.ispartof Language Resources and Evaluation es_ES
dc.rights Reserva de todos los derechos es_ES
dc.subject Cross-language es_ES
dc.subject Plagiarism detection es_ES
dc.subject Similarity es_ES
dc.subject Retrieval model es_ES
dc.subject.classification LENGUAJES Y SISTEMAS INFORMATICOS es_ES
dc.title Cross-Language Plagiarism Detection es_ES
dc.type Artículo es_ES
dc.embargo.lift 10000-01-01
dc.embargo.terms forever es_ES
dc.identifier.doi 10.1007/s10579-009-9114-z
dc.relation.projectID info:eu-repo/grantAgreement/MICINN//TIN2009-13391-C04-03/ES/Text-Enterprise 2.0: Tecnicas De Comprension De Textos Aplicadas A Las Necesidades De La Empresa 2.0/ es_ES
dc.relation.projectID info:eu-repo/grantAgreement/CONACyT//192021/ es_ES
dc.rights.accessRights Abierto es_ES
dc.contributor.affiliation Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació es_ES
dc.description.bibliographicCitation Potthast, M.; Barrón Cedeño, LA.; Stein, B.; Rosso, P. (2011). Cross-Language Plagiarism Detection. Language Resources and Evaluation. 45(1):45-62. https://doi.org/10.1007/s10579-009-9114-z es_ES
dc.description.accrualMethod S es_ES
dc.relation.publisherversion http://link.springer.com/article/10.1007/s10579-009-9114-z es_ES
dc.description.upvformatpinicio 45 es_ES
dc.description.upvformatpfin 62 es_ES
dc.type.version info:eu-repo/semantics/publishedVersion es_ES
dc.description.volume 45 es_ES
dc.description.issue 1 es_ES
dc.relation.senia 215389
dc.contributor.funder Ministerio de Ciencia e Innovación es_ES
dc.contributor.funder Consejo Nacional de Ciencia y Tecnología, México es_ES
dc.description.references Ballesteros, L. A. (2001). Resolving ambiguity for cross-language information retrieval: A dictionary approach. PhD thesis, University of Massachusetts Amherst, USA, Bruce Croft. es_ES
dc.description.references Barrón-Cedeño, A., Rosso, P., Pinto, D., & Juan A. (2008). On cross-lingual plagiarism analysis using a statistical model. In S. Benno, S. Efstathios, & K. Moshe (Eds.), ECAI 2008 workshop on uncovering plagiarism, authorship, and social software misuse (PAN 08) (pp. 9–13). Patras, Greece. es_ES
dc.description.references Baum, L. E. (1972). An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process. Inequalities, 3, 1–8. es_ES
dc.description.references Berger, A., & Lafferty, J. (1999). Information retrieval as statistical translation. In SIGIR’99: Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval (vol. 4629, pp. 222–229). Berkeley, California, United States: ACM. es_ES
dc.description.references Brin, S., Davis, J., & Garcia-Molina, H. (1995). Copy detection mechanisms for digital documents. In SIGMOD ’95 (pp. 398–409). New York, NY, USA: ACM Press. es_ES
dc.description.references Brown, P. F., Della Pietra, S. A., Della Pietra, V. J., & Mercer R. L. (1993). The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2), 263–311. es_ES
dc.description.references Ceska, Z., Toman, M., & Jezek, K. (2008). Multilingual plagiarism detection. In AIMSA’08: Proceedings of the 13th international conference on artificial intelligence (pp. 83–92). Berlin, Heidelberg: Springer. es_ES
dc.description.references Clough, P. (2003). Old and new challenges in automatic plagiarism detection. National UK Plagiarism Advisory Service, http://www.ir.shef.ac.uk/cloughie/papers/pas_plagiarism.pdf . es_ES
dc.description.references Dempster A. P., Laird N. M., Rubin D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1), 1–38. es_ES
dc.description.references Dumais, S. T., Letsche, T. A., Littman, M. L., & Landauer, T. K. (1997). Automatic cross-language retrieval using latent semantic indexing. In D. Hull & D. Oard (Eds.), AAAI-97 spring symposium series: Cross-language text and speech retrieval (pp. 18–24). Stanford University, American Association for Artificial Intelligence. es_ES
dc.description.references Gabrilovich, E., & Markovitch, S. (2007). Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In Proceedings of the 20th international joint conference for artificial intelligence, Hyderabad, India. es_ES
dc.description.references Hoad T. C., & Zobel, J. (2003). Methods for identifying versioned and plagiarised documents. American Society for Information Science and Technology, 54(3), 203–215. es_ES
dc.description.references Levow, G.-A., Oard, D. W., & Resnik, P. (2005). Dictionary-based techniques for cross-language information retrieval. Information Processing & Management, 41(3), 523–547. es_ES
dc.description.references Littman, M., Dumais, S. T., & Landauer, T. K. (1998). Automatic cross-language information retrieval using latent semantic indexing. In Cross-language information retrieval, chap. 5 (pp. 51–62). Kluwer. es_ES
dc.description.references Maurer, H., Kappe, F., & Zaka, B. (2006). Plagiarism—a survey. Journal of Universal Computer Science, 12(8), 1050–1084. es_ES
dc.description.references McCabe, D. (2005). Research report of the Center for Academic Integrity. http://www.academicintegrity.org . es_ES
dc.description.references Mcnamee, P., & Mayfield, J. (2004). Character N-gram tokenization for European language text retrieval. Information Retrieval, 7(1–2), 73–97. es_ES
dc.description.references Meyer zu Eissen, S., & Stein, B. (2006). Intrinsic plagiarism detection. In M. Lalmas, A. MacFarlane, S. M. Rüger, A. Tombros, T. Tsikrika, & A. Yavlinsky (Eds.), Proceedings of the European conference on information retrieval (ECIR 2006), volume 3936 of Lecture Notes in Computer Science (pp. 565–569). Springer. es_ES
dc.description.references Meyer zu Eissen, S., Stein, B., & Kulig, M. (2007). Plagiarism detection without reference collections. In R. Decker & H. J. Lenz (Eds.), Advances in data analysis (pp. 359–366), Springer. es_ES
dc.description.references Och, F. J., & Ney, H. (2003). A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1), 19–51. es_ES
dc.description.references Pinto, D., Juan, A., & Rosso, P. (2007). Using query-relevant documents pairs for cross-lingual information retrieval. In V. Matousek & P. Mautner (Eds.), Lecture Notes in Artificial Intelligence (pp. 630–637). Pilsen, Czech Republic. es_ES
dc.description.references Pinto, D., Civera, J., Barrón-Cedeño, A., Juan, A., & Rosso, P. (2009). A statistical approach to cross-lingual natural language tasks. Journal of Algorithms, 64(1), 51–60. es_ES
dc.description.references Potthast, M. (2007). Wikipedia in the pocket-indexing technology for near-duplicate detection and high similarity search. In C. Clarke, N. Fuhr, N. Kando, W. Kraaij, & A. de Vries (Eds.), 30th Annual international ACM SIGIR conference (pp. 909–909). ACM. es_ES
dc.description.references Potthast, M., Stein, B., & Anderka, M. (2008). A Wikipedia-based multilingual retrieval model. In C. Macdonald, I. Ounis, V. Plachouras, I. Ruthven, & R. W. White (Eds.), 30th European conference on IR research, ECIR 2008, Glasgow , volume 4956 LNCS of Lecture Notes in Computer Science (pp. 522–530). Berlin: Springer. es_ES
dc.description.references Pouliquen, B., Steinberger, R., & Ignat, C. (2003a). Automatic annotation of multilingual text collections with a conceptual thesaurus. In Proceedings of the workshop ’ontologies and information extraction’ at the Summer School ’The Semantic Web and Language Technology—its potential and practicalities’ (EUROLAN’2003) (pp. 9–28), Bucharest, Romania. es_ES
dc.description.references Pouliquen, B., Steinberger, R., & Ignat, C. (2003b). Automatic identification of document translations in large multilingual document collections. In Proceedings of the international conference recent advances in natural language processing (RANLP’2003) (pp. 401–408). Borovets, Bulgaria. es_ES
dc.description.references Stein, B. (2007). Principles of hash-based text retrieval. In C. Clarke, N. Fuhr, N. Kando, W. Kraaij, & A. de Vries (Eds.), 30th Annual international ACM SIGIR conference (pp. 527–534). ACM. es_ES
dc.description.references Stein, B. (2005). Fuzzy-fingerprints for text-based information retrieval. In K. Tochtermann & H. Maurer (Eds.), Proceedings of the 5th international conference on knowledge management (I-KNOW 05), Graz, Journal of Universal Computer Science. (pp. 572–579). Know-Center. es_ES
dc.description.references Stein, B., & Anderka, M. (2009). Collection-relative representations: A unifying view to retrieval models. In A. M. Tjoa & R. R. Wagner (Eds.), 20th International conference on database and expert systems applications (DEXA 09) (pp. 383–387). IEEE. es_ES
dc.description.references Stein, B., & Meyer zu Eissen, S. (2007). Intrinsic plagiarism analysis with meta learning. In B. Stein, M. Koppel, & E. Stamatatos (Eds.), SIGIR workshop on plagiarism analysis, authorship identification, and near-duplicate detection (PAN 07) (pp. 45–50). CEUR-WS.org. es_ES
dc.description.references Stein, B., & Potthast, M. (2007). Construction of compact retrieval models. In S. Dominich & F. Kiss (Eds.), Studies in theory of information retrieval (pp. 85–93). Foundation for Information Society. es_ES
dc.description.references Stein, B., Meyer zu Eissen, S., & Potthast, M. (2007). Strategies for retrieving plagiarized documents. In C. Clarke, N. Fuhr, N. Kando, W. Kraaij, & A. de Vries (Eds.), 30th Annual international ACM SIGIR conference (pp. 825–826). ACM. es_ES
dc.description.references Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufis, D., & Varga, D. (2006). The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. In Proceedings of the 5th international conference on language resources and evaluation (LREC’2006). es_ES
dc.description.references Steinberger, R., Pouliquen, B., & Ignat, C. (2004). Exploiting multilingual nomenclatures and language-independent text features as an interlingua for cross-lingual text analysis applications. In Proceedings of the 4th Slovenian language technology conference. Information Society 2004 (IS’2004). es_ES
dc.description.references Vinokourov, A., Shawe-Taylor, J., & Cristianini, N. (2003). Inferring a semantic representation of text via cross-language correlation analysis. In S. Becker, S. Thrun, & K. Obermayer (Eds.), NIPS-02: Advances in neural information processing systems (pp. 1473–1480). MIT Press. es_ES
dc.description.references Yang, Y., Carbonell, J. G., Brown, R. D., & Frederking, R. E. (1998). Translingual information retrieval: Learning from bilingual corpora. Artificial Intelligence, 103(1–2), 323–345. es_ES


Este ítem aparece en la(s) siguiente(s) colección(ones)

Mostrar el registro sencillo del ítem