- -

A resource-light method for cross-lingual semantic textual similarity

RiuNet: Repositorio Institucional de la Universidad Politécnica de Valencia

Compartir/Enviar a

Citas

Estadísticas

  • Estadisticas de Uso

A resource-light method for cross-lingual semantic textual similarity

Mostrar el registro sencillo del ítem

Ficheros en el ítem

dc.contributor.author Glavas, Goran es_ES
dc.contributor.author Franco-Salvador, Marc es_ES
dc.contributor.author Ponzetto, Simone Paolo es_ES
dc.contributor.author Rosso, Paolo es_ES
dc.date.accessioned 2020-06-13T03:32:29Z
dc.date.available 2020-06-13T03:32:29Z
dc.date.issued 2018-03-01 es_ES
dc.identifier.issn 0950-7051 es_ES
dc.identifier.uri http://hdl.handle.net/10251/146277
dc.description.abstract [EN] Recognizing semantically similar sentences or paragraphs across languages is beneficial for many tasks, ranging from cross-lingual information retrieval and plagiarism detection to machine translation. Recently proposed methods for predicting cross-lingual semantic similarity of short texts, however, make use of tools and resources (e.g., machine translation systems, syntactic parsers or named entity recognition) that for many languages (or language pairs) do not exist. In contrast, we propose an unsupervised and a very resource-light approach for measuring semantic similarity between texts in different languages. To operate in the bilingual (or multilingual) space, we project continuous word vectors (i.e., word embeddings) from one language to the vector space of the other language via the linear translation model. We then align words according to the similarity of their vectors in the bilingual embedding space and investigate different unsupervised measures of semantic similarity exploiting bilingual embeddings and word alignments. Requiring only a limited-size set of word translation pairs between the languages, the proposed approach is applicable to virtually any pair of languages for which there exists a sufficiently large corpus, required to learn monolingual word embeddings. Experimental results on three different datasets for measuring semantic textual similarity show that our simple resource-light approach reaches performance close to that of supervised and resource-intensive methods, displaying stability across different language pairs. Furthermore, we evaluate the proposed method on two extrinsic tasks, namely extraction of parallel sentences from comparable corpora and cross-lingual plagiarism detection, and show that it yields performance comparable to those of complex resource-intensive state-of-the-art models for the respective tasks. (C) 2017 Published by Elsevier B.V. es_ES
dc.description.sponsorship Part of the work presented in this article was performed during second author's research visit to the University of Mannheim, supported by Contact Fellowship awarded by the DAAD scholarship program "STIBET Doktoranden". The research of the last author has been carried out in the framework of the SomEMBED project (TIN2015-71147-C2-1-P). Furthermore, this work was partially funded by the Junior-professor funding programme of the Ministry of Science, Research and the Arts of the state of Baden-Wurttemberg (project "Deep semantic models for high-end NLP application"). es_ES
dc.language Inglés es_ES
dc.publisher Elsevier es_ES
dc.relation.ispartof Knowledge-Based Systems es_ES
dc.rights Reserva de todos los derechos es_ES
dc.subject Semantic textual similarity es_ES
dc.subject Cross-lingual Word embeddings es_ES
dc.subject Word alignment Parallel sentences alignment es_ES
dc.subject Plagiarism detection es_ES
dc.subject.classification LENGUAJES Y SISTEMAS INFORMATICOS es_ES
dc.title A resource-light method for cross-lingual semantic textual similarity es_ES
dc.type Artículo es_ES
dc.identifier.doi 10.1016/j.knosys.2017.11.041 es_ES
dc.relation.projectID info:eu-repo/grantAgreement/MINECO//TIN2015-71147-C2-1-P/ES/COMPRENSION DEL LENGUAJE EN LOS MEDIOS DE COMUNICACION SOCIAL - REPRESENTANDO CONTEXTOS DE FORMA CONTINUA/ es_ES
dc.rights.accessRights Abierto es_ES
dc.contributor.affiliation Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació es_ES
dc.description.bibliographicCitation Glavas, G.; Franco-Salvador, M.; Ponzetto, SP.; Rosso, P. (2018). A resource-light method for cross-lingual semantic textual similarity. Knowledge-Based Systems. 143:1-9. https://doi.org/10.1016/j.knosys.2017.11.041 es_ES
dc.description.accrualMethod S es_ES
dc.relation.publisherversion https://doi.org/10.1016/j.knosys.2017.11.041 es_ES
dc.description.upvformatpinicio 1 es_ES
dc.description.upvformatpfin 9 es_ES
dc.type.version info:eu-repo/semantics/publishedVersion es_ES
dc.description.volume 143 es_ES
dc.relation.pasarela S\384153 es_ES
dc.contributor.funder Ministerio de Economía y Competitividad es_ES
dc.contributor.funder Deutscher Akademischer Austauschdienst es_ES
dc.contributor.funder Ministerium für Wissenschaft, Forschung und Kunst Baden-Württemberg es_ES


Este ítem aparece en la(s) siguiente(s) colección(ones)

Mostrar el registro sencillo del ítem