Cross-Language Plagiarism Detection

Potthast, Martin; Barrón Cedeño, Luis Alberto; Stein, Benno; Rosso, Paolo

doi:10.1007/s10579-009-9114-z

Identificarse

Buscar en RiuNet

Listar

Todo RiuNet
Esta colección

Mi cuenta

Acceder

Estadísticas

Ver Estadísticas de uso

Ayuda RiuNet

Admin. UPV

Compartir/Enviar a

Citas

Estadísticas

Cross-Language Plagiarism Detection

Mostrar el registro sencillo del ítem

Ficheros en el ítem

Nombre: Cross-Language ...

Tamaño: 222.3Kb

Formato: PDF

Descripción: Versión del Autor.

Abrir

Nombre: Barrón;Rosso - ...

Tamaño: 514.0Kb

Formato: PDF

Descripción: Versión editorial

Solicitar una copia al autor

dc.contributor.author	Potthast, Martin	es_ES
dc.contributor.author	Barrón Cedeño, Luis Alberto	es_ES
dc.contributor.author	Stein, Benno	es_ES
dc.contributor.author	Rosso, Paolo	es_ES
dc.date.accessioned	2014-05-14T12:32:31Z
dc.date.issued	2011-03
dc.identifier.issn	1574-020X
dc.identifier.uri	http://hdl.handle.net/10251/37479
dc.description.abstract	Cross-language plagiarism detection deals with the automatic identification and extraction of plagiarism in a multilingual setting. In this setting, a suspicious document is given, and the task is to retrieve all sections from the document that originate from a large, multilingual document collection. Our contributions in this field are as follows: (1) a comprehensive retrieval process for cross-language plagiarism detection is introduced, highlighting the differences to monolingual plagiarism detection, (2) state-of-the-art solutions for two important subtasks are reviewed, (3) retrieval models for the assessment of cross-language similarity are surveyed, and, (4) the three models CL-CNG, CL-ESA and CL-ASA are compared. Our evaluation is of realistic scale: it relies on 120,000 test documents which are selected from the corpora JRC-Acquis and Wikipedia, so that for each test document highly similar documents are available in all of the six languages English, German, Spanish, French, Dutch, and Polish. The models are employed in a series of ranking tasks, and more than 100 million similarities are computed with each model. The results of our evaluation indicate that CL-CNG, despite its simple approach, is the best choice to rank and compare texts across languages if they are syntactically related. CL-ESA almost matches the performance of CL-CNG, but on arbitrary pairs of languages. CL-ASA works best on "exact" translations but does not generalize well.	es_ES
dc.description.sponsorship	This work was partially supported by the TEXT-ENTERPRISE 2.0 TIN2009-13391-C04-03 project and the CONACyT-Mexico 192021 grant.	en_EN
dc.format.extent	18	es_ES
dc.language	Inglés	es_ES
dc.publisher	Springer Verlag (Germany)	es_ES
dc.relation.ispartof	Language Resources and Evaluation	es_ES
dc.rights	Reserva de todos los derechos	es_ES
dc.subject	Cross-language	es_ES
dc.subject	Plagiarism detection	es_ES
dc.subject	Similarity	es_ES
dc.subject	Retrieval model	es_ES
dc.subject.classification	LENGUAJES Y SISTEMAS INFORMATICOS	es_ES
dc.title	Cross-Language Plagiarism Detection	es_ES
dc.type	Artículo	es_ES
dc.embargo.lift	10000-01-01
dc.embargo.terms	forever	es_ES
dc.identifier.doi	10.1007/s10579-009-9114-z
dc.relation.projectID	info:eu-repo/grantAgreement/MICINN//TIN2009-13391-C04-03/ES/Text-Enterprise 2.0: Tecnicas De Comprension De Textos Aplicadas A Las Necesidades De La Empresa 2.0/	es_ES
dc.relation.projectID	info:eu-repo/grantAgreement/CONACyT//192021/	es_ES
dc.rights.accessRights	Abierto	es_ES
dc.contributor.affiliation	Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació	es_ES
dc.description.bibliographicCitation	Potthast, M.; Barrón Cedeño, LA.; Stein, B.; Rosso, P. (2011). Cross-Language Plagiarism Detection. Language Resources and Evaluation. 45(1):45-62. https://doi.org/10.1007/s10579-009-9114-z	es_ES
dc.description.accrualMethod	S	es_ES
dc.relation.publisherversion	http://link.springer.com/article/10.1007/s10579-009-9114-z	es_ES
dc.description.upvformatpinicio	45	es_ES
dc.description.upvformatpfin	62	es_ES
dc.type.version	info:eu-repo/semantics/publishedVersion	es_ES
dc.description.volume	45	es_ES
dc.description.issue	1	es_ES
dc.relation.senia	215389
dc.contributor.funder	Ministerio de Ciencia e Innovación	es_ES
dc.contributor.funder	Consejo Nacional de Ciencia y Tecnología, México	es_ES
dc.description.references	Ballesteros, L. A. (2001). Resolving ambiguity for cross-language information retrieval: A dictionary approach. PhD thesis, University of Massachusetts Amherst, USA, Bruce Croft.	es_ES
dc.description.references	Barrón-Cedeño, A., Rosso, P., Pinto, D., & Juan A. (2008). On cross-lingual plagiarism analysis using a statistical model. In S. Benno, S. Efstathios, & K. Moshe (Eds.), ECAI 2008 workshop on uncovering plagiarism, authorship, and social software misuse (PAN 08) (pp. 9–13). Patras, Greece.	es_ES
dc.description.references	Baum, L. E. (1972). An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process. Inequalities, 3, 1–8.	es_ES
dc.description.references	Berger, A., & Lafferty, J. (1999). Information retrieval as statistical translation. In SIGIR’99: Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval (vol. 4629, pp. 222–229). Berkeley, California, United States: ACM.	es_ES
dc.description.references	Brin, S., Davis, J., & Garcia-Molina, H. (1995). Copy detection mechanisms for digital documents. In SIGMOD ’95 (pp. 398–409). New York, NY, USA: ACM Press.	es_ES
dc.description.references	Brown, P. F., Della Pietra, S. A., Della Pietra, V. J., & Mercer R. L. (1993). The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2), 263–311.	es_ES
dc.description.references	Ceska, Z., Toman, M., & Jezek, K. (2008). Multilingual plagiarism detection. In AIMSA’08: Proceedings of the 13th international conference on artificial intelligence (pp. 83–92). Berlin, Heidelberg: Springer.	es_ES
dc.description.references	Clough, P. (2003). Old and new challenges in automatic plagiarism detection. National UK Plagiarism Advisory Service, http://www.ir.shef.ac.uk/cloughie/papers/pas_plagiarism.pdf .	es_ES
dc.description.references	Dempster A. P., Laird N. M., Rubin D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1), 1–38.	es_ES
dc.description.references	Dumais, S. T., Letsche, T. A., Littman, M. L., & Landauer, T. K. (1997). Automatic cross-language retrieval using latent semantic indexing. In D. Hull & D. Oard (Eds.), AAAI-97 spring symposium series: Cross-language text and speech retrieval (pp. 18–24). Stanford University, American Association for Artificial Intelligence.	es_ES
dc.description.references	Gabrilovich, E., & Markovitch, S. (2007). Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In Proceedings of the 20th international joint conference for artificial intelligence, Hyderabad, India.	es_ES
dc.description.references	Hoad T. C., & Zobel, J. (2003). Methods for identifying versioned and plagiarised documents. American Society for Information Science and Technology, 54(3), 203–215.	es_ES
dc.description.references	Levow, G.-A., Oard, D. W., & Resnik, P. (2005). Dictionary-based techniques for cross-language information retrieval. Information Processing & Management, 41(3), 523–547.	es_ES
dc.description.references	Littman, M., Dumais, S. T., & Landauer, T. K. (1998). Automatic cross-language information retrieval using latent semantic indexing. In Cross-language information retrieval, chap. 5 (pp. 51–62). Kluwer.	es_ES
dc.description.references	Maurer, H., Kappe, F., & Zaka, B. (2006). Plagiarism—a survey. Journal of Universal Computer Science, 12(8), 1050–1084.	es_ES
dc.description.references	McCabe, D. (2005). Research report of the Center for Academic Integrity. http://www.academicintegrity.org .	es_ES
dc.description.references	Mcnamee, P., & Mayfield, J. (2004). Character N-gram tokenization for European language text retrieval. Information Retrieval, 7(1–2), 73–97.	es_ES
dc.description.references	Meyer zu Eissen, S., & Stein, B. (2006). Intrinsic plagiarism detection. In M. Lalmas, A. MacFarlane, S. M. Rüger, A. Tombros, T. Tsikrika, & A. Yavlinsky (Eds.), Proceedings of the European conference on information retrieval (ECIR 2006), volume 3936 of Lecture Notes in Computer Science (pp. 565–569). Springer.	es_ES
dc.description.references	Meyer zu Eissen, S., Stein, B., & Kulig, M. (2007). Plagiarism detection without reference collections. In R. Decker & H. J. Lenz (Eds.), Advances in data analysis (pp. 359–366), Springer.	es_ES
dc.description.references	Och, F. J., & Ney, H. (2003). A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1), 19–51.	es_ES
dc.description.references	Pinto, D., Juan, A., & Rosso, P. (2007). Using query-relevant documents pairs for cross-lingual information retrieval. In V. Matousek & P. Mautner (Eds.), Lecture Notes in Artificial Intelligence (pp. 630–637). Pilsen, Czech Republic.	es_ES
dc.description.references	Pinto, D., Civera, J., Barrón-Cedeño, A., Juan, A., & Rosso, P. (2009). A statistical approach to cross-lingual natural language tasks. Journal of Algorithms, 64(1), 51–60.	es_ES
dc.description.references	Potthast, M. (2007). Wikipedia in the pocket-indexing technology for near-duplicate detection and high similarity search. In C. Clarke, N. Fuhr, N. Kando, W. Kraaij, & A. de Vries (Eds.), 30th Annual international ACM SIGIR conference (pp. 909–909). ACM.	es_ES
dc.description.references	Potthast, M., Stein, B., & Anderka, M. (2008). A Wikipedia-based multilingual retrieval model. In C. Macdonald, I. Ounis, V. Plachouras, I. Ruthven, & R. W. White (Eds.), 30th European conference on IR research, ECIR 2008, Glasgow , volume 4956 LNCS of Lecture Notes in Computer Science (pp. 522–530). Berlin: Springer.	es_ES
dc.description.references	Pouliquen, B., Steinberger, R., & Ignat, C. (2003a). Automatic annotation of multilingual text collections with a conceptual thesaurus. In Proceedings of the workshop ’ontologies and information extraction’ at the Summer School ’The Semantic Web and Language Technology—its potential and practicalities’ (EUROLAN’2003) (pp. 9–28), Bucharest, Romania.	es_ES
dc.description.references	Pouliquen, B., Steinberger, R., & Ignat, C. (2003b). Automatic identification of document translations in large multilingual document collections. In Proceedings of the international conference recent advances in natural language processing (RANLP’2003) (pp. 401–408). Borovets, Bulgaria.	es_ES
dc.description.references	Stein, B. (2007). Principles of hash-based text retrieval. In C. Clarke, N. Fuhr, N. Kando, W. Kraaij, & A. de Vries (Eds.), 30th Annual international ACM SIGIR conference (pp. 527–534). ACM.	es_ES
dc.description.references	Stein, B. (2005). Fuzzy-fingerprints for text-based information retrieval. In K. Tochtermann & H. Maurer (Eds.), Proceedings of the 5th international conference on knowledge management (I-KNOW 05), Graz, Journal of Universal Computer Science. (pp. 572–579). Know-Center.	es_ES
dc.description.references	Stein, B., & Anderka, M. (2009). Collection-relative representations: A unifying view to retrieval models. In A. M. Tjoa & R. R. Wagner (Eds.), 20th International conference on database and expert systems applications (DEXA 09) (pp. 383–387). IEEE.	es_ES
dc.description.references	Stein, B., & Meyer zu Eissen, S. (2007). Intrinsic plagiarism analysis with meta learning. In B. Stein, M. Koppel, & E. Stamatatos (Eds.), SIGIR workshop on plagiarism analysis, authorship identification, and near-duplicate detection (PAN 07) (pp. 45–50). CEUR-WS.org.	es_ES
dc.description.references	Stein, B., & Potthast, M. (2007). Construction of compact retrieval models. In S. Dominich & F. Kiss (Eds.), Studies in theory of information retrieval (pp. 85–93). Foundation for Information Society.	es_ES
dc.description.references	Stein, B., Meyer zu Eissen, S., & Potthast, M. (2007). Strategies for retrieving plagiarized documents. In C. Clarke, N. Fuhr, N. Kando, W. Kraaij, & A. de Vries (Eds.), 30th Annual international ACM SIGIR conference (pp. 825–826). ACM.	es_ES
dc.description.references	Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufis, D., & Varga, D. (2006). The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. In Proceedings of the 5th international conference on language resources and evaluation (LREC’2006).	es_ES
dc.description.references	Steinberger, R., Pouliquen, B., & Ignat, C. (2004). Exploiting multilingual nomenclatures and language-independent text features as an interlingua for cross-lingual text analysis applications. In Proceedings of the 4th Slovenian language technology conference. Information Society 2004 (IS’2004).	es_ES
dc.description.references	Vinokourov, A., Shawe-Taylor, J., & Cristianini, N. (2003). Inferring a semantic representation of text via cross-language correlation analysis. In S. Becker, S. Thrun, & K. Obermayer (Eds.), NIPS-02: Advances in neural information processing systems (pp. 1473–1480). MIT Press.	es_ES
dc.description.references	Yang, Y., Carbonell, J. G., Brown, R. D., & Frederking, R. E. (1998). Translingual information retrieval: Learning from bilingual corpora. Artificial Intelligence, 103(1–2), 323–345.	es_ES

Este ítem aparece en la(s) siguiente(s) colección(ones)

Artículos, conferencias, monografías [47175]

Mostrar el registro sencillo del ítem

Cross-Language Plagiarism Detection

RiuNet: Repositorio Institucional de la Universidad Politécnica de Valencia

Buscar en RiuNet

Listar

Todo RiuNet

Esta colección

Mi cuenta

Estadísticas

Ayuda RiuNet

Admin. UPV

Compartir/Enviar a

Citas

Estadísticas

Cross-Language Plagiarism Detection

Ficheros en el ítem

Este ítem aparece en la(s) siguiente(s) colección(ones)