A Comparison of Approaches for Measuring Cross-Lingual Similarity of Wikipedia Articles

Barrón Cedeño, Luis Alberto; Paramita, Monica Lestari; Clough, Paul; Rosso, Paolo

doi:10.1007/978-3-319-06028-6_36

Identificarse

Buscar en RiuNet

Listar

Todo RiuNet
Esta colección

Mi cuenta

Acceder

Estadísticas

Ver Estadísticas de uso

Ayuda RiuNet

Admin. UPV

Compartir/Enviar a

Citas

Estadísticas

A Comparison of Approaches for Measuring Cross-Lingual Similarity of Wikipedia Articles

Mostrar el registro sencillo del ítem

Ficheros en el ítem

Nombre: ecir-wikisim.pdf

Tamaño: 388.0Kb

Formato: PDF

Descripción: Versión del Autor.

Abrir

Nombre: ECIR-LNCS.pdf

Tamaño: 335.0Kb

Formato: PDF

Descripción: Versión editorial

Solicitar una copia al autor

dc.contributor.author	Barrón Cedeño, Luis Alberto	es_ES
dc.contributor.author	Paramita, Monica Lestari	es_ES
dc.contributor.author	Clough, Paul	es_ES
dc.contributor.author	Rosso, Paolo	es_ES
dc.date.accessioned	2015-05-05T12:32:56Z
dc.date.available	2015-05-05T12:32:56Z
dc.date.issued	2014
dc.identifier.isbn	978-3-319-06027-9
dc.identifier.issn	0302-9743
dc.identifier.uri	http://hdl.handle.net/10251/49698
dc.description.abstract	Wikipedia has been used as a source of comparable texts for a range of tasks, such as Statistical Machine Translation and CrossLanguage Information Retrieval. Articles written in different languages on the same topic are often connected through inter-language-links. However, the extent to which these articles are similar is highly variable and this may impact on the use of Wikipedia as a comparable resource. In this paper we compare various language-independent methods for measuring cross-lingual similarity: character n-grams, cognateness, word count ratio, and an approach based on outlinks. These approaches are compared against a baseline utilising MT resources. Measures are also compared to human judgements of similarity using a manually created resource containing 700 pairs of Wikipedia articles (in 7 language pairs). Results indicate that a combination of language-independent models (char-ngrams, outlinks and word-count ratio) is highly effective for identifying cross-lingual similarity and performs comparably to language-dependent models (translation and monolingual analysis).	es_ES
dc.description.sponsorship	The work of the first author was in the framework of the Tacardi research project (TIN2012-38523-C02-00). The work of the fourth author was in the framework of the DIANA-Applications (TIN2012-38603-C02-01) and WIQ-EI IRSES (FP7 Marie Curie No. 269180) research projects.	es_ES
dc.language	Inglés	es_ES
dc.publisher	Springer Verlag (Germany)	es_ES
dc.relation.ispartof	Advances in Information Retrieval	es_ES
dc.relation.ispartofseries	Lecture Notes in Computer Science;8416
dc.rights	Reserva de todos los derechos	es_ES
dc.subject	Wikipedia	es_ES
dc.subject	Cross-Lingual Similarity	es_ES
dc.subject.classification	LENGUAJES Y SISTEMAS INFORMATICOS	es_ES
dc.title	A Comparison of Approaches for Measuring Cross-Lingual Similarity of Wikipedia Articles	es_ES
dc.type	Capítulo de libro	es_ES
dc.identifier.doi	10.1007/978-3-319-06028-6_36
dc.relation.projectID	info:eu-repo/grantAgreement/MINECO//TIN2012-38523-C02/	es_ES
dc.relation.projectID	info:eu-repo/grantAgreement/EC/FP7/269180/EU/Web Information Quality Evaluation Initiative/	es_ES
dc.relation.projectID	info:eu-repo/grantAgreement/MINECO//TIN2012-38603-C02-01/ES/DIANA-APPLICATIONS: FINDING HIDDEN KNOWLEDGE IN TEXTS: APPLICATIONS/	es_ES
dc.rights.accessRights	Abierto	es_ES
dc.contributor.affiliation	Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació	es_ES
dc.description.bibliographicCitation	Barrón Cedeño, LA.; Paramita, ML.; Clough, P.; Rosso, P. (2014). A Comparison of Approaches for Measuring Cross-Lingual Similarity of Wikipedia Articles. En Advances in Information Retrieval. Springer Verlag (Germany). 424-429. https://doi.org/10.1007/978-3-319-06028-6_36	es_ES
dc.description.accrualMethod	S	es_ES
dc.relation.publisherversion	http://link.springer.com/chapter/10.1007/978-3-319-06028-6_36	es_ES
dc.description.upvformatpinicio	424	es_ES
dc.description.upvformatpfin	429	es_ES
dc.type.version	info:eu-repo/semantics/publishedVersion	es_ES
dc.relation.senia	286233
dc.contributor.funder	Ministerio de Economía y Competitividad	es_ES
dc.contributor.funder	European Commission	es_ES
dc.description.references	Adafre, S., de Rijke, M.: Finding Similar Sentences across Multiple Languages in Wikipedia. In: Proc. of the 11th Conf. of the European Chapter of the Association for Computational Linguistics, pp. 62–69 (2006)	es_ES
dc.description.references	Dumais, S., Letsche, T., Littman, M., Landauer, T.: Automatic Cross-Language Retrieval Using Latent Semantic Indexing. In: AAAI 1997 Spring Symposium Series: Cross-Language Text and Speech Retrieval, Stanford University, pp. 24–26 (1997)	es_ES
dc.description.references	Filatova, E.: Directions for exploiting asymmetries in multilingual Wikipedia. In: Proc. of the Third Intl. Workshop on Cross Lingual Information Access: Addressing the Information Need of Multilingual Societies, Boulder, CO (2009)	es_ES
dc.description.references	Levow, G.A., Oard, D., Resnik, P.: Dictionary-Based Techniques for Cross-Language Information Retrieval. Information Processing and Management: Special Issue on Cross-Language Information Retrieval 41(3), 523–547 (2005)	es_ES
dc.description.references	Mcnamee, P., Mayfield, J.: Character N-Gram Tokenization for European Language Text Retrieval. Information Retrieval 7(1-2), 73–97 (2004)	es_ES
dc.description.references	Mihalcea, R.: Using Wikipedia for Automatic Word Sense Disambiguation. In: Proc. of NAACL 2007. ACL, Rochester (2007)	es_ES
dc.description.references	Mohammadi, M., GhasemAghaee, N.: Building Bilingual Parallel Corpora based on Wikipedia. In: Second Intl. Conf. on Computer Engineering and Applications., vol. 2, pp. 264–268 (2010)	es_ES
dc.description.references	Munteanu, D., Fraser, A., Marcu, D.: Improved Machine Translation Performace via Parallel Sentence Extraction from Comparable Corpora. In: Proc. of the Human Language Technology and North American Association for Computational Linguistics Conf (HLT/NAACL 2004), Boston, MA (2004)	es_ES
dc.description.references	Nguyen, D., Overwijk, A., Hauff, C., Trieschnigg, D.R.B., Hiemstra, D., de Jong, F.: WikiTranslate: Query Translation for Cross-Lingual Information Retrieval Using Only Wikipedia. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Peñas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706, pp. 58–65. Springer, Heidelberg (2009)	es_ES
dc.description.references	Paramita, M.L., Clough, P.D., Aker, A., Gaizauskas, R.: Correlation between Similarity Measures for Inter-Language Linked Wikipedia Articles. In: Calzolari, E.A. (ed.) Proc. of the 8th Intl. Language Resources and Evaluation (LREC 2012), pp. 790–797. ELRA, Istanbul (2012)	es_ES
dc.description.references	Potthast, M., Stein, B., Anderka, M.: A Wikipedia-Based Multilingual Retrieval Model. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 522–530. Springer, Heidelberg (2008)	es_ES
dc.description.references	Simard, M., Foster, G.F., Isabelle, P.: Using Cognates to Align Sentences in Bilingual Corpora. In: Proc. of the Fourth Intl. Conf. on Theoretical and Methodological Issues in Machine Translation (1992)	es_ES
dc.description.references	Steinberger, R., Pouliquen, B., Hagman, J.: Cross-lingual Document Similarity Calculation Using the Multilingual Thesaurus EUROVOC. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 415–424. Springer, Heidelberg (2002)	es_ES
dc.description.references	Toral, A., Muñoz, R.: A proposal to automatically build and maintain gazetteers for Named Entity Recognition using Wikipedia. In: Proc. of the EACL Workshop on New Text 2006. Association for Computational Linguistics, Trento (2006)	es_ES

Este ítem aparece en la(s) siguiente(s) colección(ones)

Mostrar el registro sencillo del ítem

A Comparison of Approaches for Measuring Cross-Lingual Similarity of Wikipedia Articles

RiuNet: Repositorio Institucional de la Universidad Politécnica de Valencia

Buscar en RiuNet

Listar

Todo RiuNet

Esta colección

Mi cuenta

Estadísticas

Ayuda RiuNet

Admin. UPV

Compartir/Enviar a

Citas

Estadísticas

A Comparison of Approaches for Measuring Cross-Lingual Similarity of Wikipedia Articles

Ficheros en el ítem

Este ítem aparece en la(s) siguiente(s) colección(ones)