- -

Does more data always yield better translations?

RiuNet: Repositorio Institucional de la Universidad Politécnica de Valencia

Compartir/Enviar a

Citas

Estadísticas

  • Estadisticas de Uso

Does more data always yield better translations?

Mostrar el registro sencillo del ítem

Ficheros en el ítem

dc.contributor.author Gascó Mora, Guillem es_ES
dc.contributor.author Rocha Sánchez, Martha Alicia es_ES
dc.contributor.author Sanchis Trilles, Germán es_ES
dc.contributor.author Andrés Ferrer, Jesús es_ES
dc.contributor.author Casacuberta Nolla, Francisco es_ES
dc.date.accessioned 2014-01-29T07:19:58Z
dc.date.available 2014-01-29T07:19:58Z
dc.date.issued 2012-04-23
dc.identifier.isbn 978-1-937284-19-0
dc.identifier.uri http://hdl.handle.net/10251/35214
dc.description.abstract Nowadays, there are large amounts of data available to train statistical machine translation systems. However, it is not clear whether all the training data actually help or not. A system trained on a subset of such huge bilingual corpora might outperform the use of all the bilingual data. This paper studies such issues by analysing two training data selection techniques: one based on approximating the probability of an indomain corpus; and another based on infrequent n-gram occurrence. Experimental results not only report significant improvements over random sentence selection but also an improvement over a system trained with the whole available data. Surprisingly, the improvements are obtained with just a small fraction of the data that accounts for less than 0.5% of the sentences. Afterwards, we show that a much larger room for improvement exists, although this is done under non-realistic conditions. es_ES
dc.description.sponsorship The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement nr. 287755. This work was also supported by the Spanish MEC/MICINN under the MIPRCV ”Consolider Ingenio 2010” program (CSD2007-00018), and iTrans2 (TIN2009-14511) project. Also supported by the Spanish MITyC under the erudito.com (TSI-020110-2009-439) project and Instituto Tecnológico de León, DGEST-PROMEP y CONACYT, México.
dc.language Inglés es_ES
dc.publisher Association for Computational Linguistics es_ES
dc.rights Reserva de todos los derechos es_ES
dc.subject Bilingual corpora es_ES
dc.subject Training data selection techniques es_ES
dc.subject Probability of an indomain corpus es_ES
dc.subject Infrequent n-gram occurrence es_ES
dc.subject.classification ESTADISTICA E INVESTIGACION OPERATIVA es_ES
dc.subject.classification LENGUAJES Y SISTEMAS INFORMATICOS es_ES
dc.title Does more data always yield better translations? es_ES
dc.type Comunicación en congreso es_ES
dc.relation.projectID info:eu-repo/grantAgreement/EC/FP7/287755/EU/Transcription and Translation of Video Lectures/ es_ES
dc.relation.projectID info:eu-repo/grantAgreement/MEC//CSD2007-00018/ES/Multimodal Intraction in Pattern Recognition and Computer Visionm/ es_ES
dc.relation.projectID info:eu-repo/grantAgreement/MICINN//TIN2009-14511/ES/Traduccion De Textos Y Transcripcion De Voz Interactivas/ es_ES
dc.relation.projectID info:eu-repo/grantAgreement/MITURCO//TSI-020110-2009-0439/ES/ERUDITO.COM/ es_ES
dc.rights.accessRights Abierto es_ES
dc.contributor.affiliation Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació es_ES
dc.description.bibliographicCitation Gascó Mora, G.; Rocha Sánchez, MA.; Sanchis Trilles, G.; Andrés Ferrer, J.; Casacuberta Nolla, F. (2012). Does more data always yield better translations?. Association for Computational Linguistics. 152-161. http://hdl.handle.net/10251/35214 es_ES
dc.description.accrualMethod S es_ES
dc.relation.conferencename 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2012) es_ES
dc.relation.conferencedate 2012-04-23 es_ES
dc.relation.conferenceplace Avignon, Francia es_ES
dc.relation.publisherversion http://www.aclweb.org/anthology/E12-1016 es_ES
dc.description.upvformatpinicio 152 es_ES
dc.description.upvformatpfin 161 es_ES
dc.relation.senia 234920
dc.contributor.funder European Commission
dc.contributor.funder Ministerio de Ciencia e Innovación
dc.contributor.funder Instituto Tecnológico de León, México
dc.contributor.funder Dirección General de Educación Superior Tecnológica, México
dc.contributor.funder Consejo Nacional de Ciencia y Tecnología, México
dc.contributor.funder Ministerio de Educación y Ciencia es_ES
dc.contributor.funder Ministerio de Industria, Turismo y Comercio es_ES


Este ítem aparece en la(s) siguiente(s) colección(ones)

Mostrar el registro sencillo del ítem