- -

Vector sentences representation for data selection in statistical machine translation

RiuNet: Repositorio Institucional de la Universidad Politécnica de Valencia

Compartir/Enviar a

Citas

Estadísticas

  • Estadisticas de Uso

Vector sentences representation for data selection in statistical machine translation

Mostrar el registro sencillo del ítem

Ficheros en el ítem

dc.contributor.author Chinea-Rios, Mara es_ES
dc.contributor.author Sanchis Trilles, Germán es_ES
dc.contributor.author Casacuberta Nolla, Francisco es_ES
dc.date.accessioned 2020-11-20T04:31:36Z
dc.date.available 2020-11-20T04:31:36Z
dc.date.issued 2019-07 es_ES
dc.identifier.issn 0885-2308 es_ES
dc.identifier.uri http://hdl.handle.net/10251/155404
dc.description.abstract [EN] One of the most popular approaches to machine translation consists in formulating the problem as a pattern recognition approach. Under this perspective, bilingual corpora are precious resources, as they allow for a proper estimation of the underlying models. In this framework, selecting the best possible corpus is critical, and data selection aims to find the best subset of the bilingual sentences from an available pool of sentences such that the final translation quality is improved. In this paper, we present a new data selection technique that leverages a continuous vector-space representation of sentences. Experimental results report improvements compared not only with a system trained only with in-domain data, but also compared with a system trained on all the available data. Finally, we compared our proposal with other state-of-the-art data selection techniques (Cross-entropy selection and Infrequent ngrams recovery) in two different scenarios, obtaining very promising results with our proposal: our data selection strategy is able to yield results that are at least as good as the best-performing strfategy for each scenario. The empirical results reported are coherent across different language pairs. es_ES
dc.description.sponsorship Work supported by the Generalitat Valenciana under grant ALMAMATER (PrometeoII/2014/030) and the FPI (2014) grant by Universitat Politècnica de València. es_ES
dc.language Inglés es_ES
dc.publisher Elsevier es_ES
dc.relation.ispartof Computer Speech & Language es_ES
dc.rights Reconocimiento - No comercial - Sin obra derivada (by-nc-nd) es_ES
dc.subject Statistical machine translation es_ES
dc.subject Data selection es_ES
dc.subject Continuous vector-space representation es_ES
dc.subject Cross-entropy es_ES
dc.subject Infrequent ngrams recovery es_ES
dc.subject.classification LENGUAJES Y SISTEMAS INFORMATICOS es_ES
dc.title Vector sentences representation for data selection in statistical machine translation es_ES
dc.type Artículo es_ES
dc.identifier.doi 10.1016/j.csl.2018.12.005 es_ES
dc.relation.projectID info:eu-repo/grantAgreement/GVA//PROMETEOII%2F2014%2F030/ES/ Adaptive learning and multimodality in machine translation and text transcription/ es_ES
dc.relation.projectID info:eu-repo/grantAgreement/UPV//FPI-2014 es_ES
dc.rights.accessRights Abierto es_ES
dc.contributor.affiliation Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació es_ES
dc.description.bibliographicCitation Chinea-Rios, M.; Sanchis Trilles, G.; Casacuberta Nolla, F. (2019). Vector sentences representation for data selection in statistical machine translation. Computer Speech & Language. 56:1-16. https://doi.org/10.1016/j.csl.2018.12.005 es_ES
dc.description.accrualMethod S es_ES
dc.relation.publisherversion https://doi.org/10.1016/j.csl.2018.12.005 es_ES
dc.description.upvformatpinicio 1 es_ES
dc.description.upvformatpfin 16 es_ES
dc.type.version info:eu-repo/semantics/publishedVersion es_ES
dc.description.volume 56 es_ES
dc.relation.pasarela S\403697 es_ES
dc.contributor.funder Generalitat Valenciana es_ES
dc.contributor.funder Universitat Politècnica de València es_ES


Este ítem aparece en la(s) siguiente(s) colección(ones)

Mostrar el registro sencillo del ítem