- -

Does more data always yield better translations?

RiuNet: Institutional repository of the Polithecnic University of Valencia

Share/Send to

Cited by

Statistics

Does more data always yield better translations?

Show simple item record

Files in this item

dc.contributor.author Gascó Mora, Guillem es_ES
dc.contributor.author Rocha Sánchez, Martha Alicia es_ES
dc.contributor.author Sanchis Trilles, Germán es_ES
dc.contributor.author Andrés Ferrer, Jesús es_ES
dc.contributor.author Casacuberta Nolla, Francisco es_ES
dc.date.accessioned 2014-01-29T07:19:58Z
dc.date.available 2014-01-29T07:19:58Z
dc.date.issued 2012-04-23
dc.identifier.isbn 978-1-937284-19-0
dc.identifier.uri http://hdl.handle.net/10251/35214
dc.description.abstract Nowadays, there are large amounts of data available to train statistical machine translation systems. However, it is not clear whether all the training data actually help or not. A system trained on a subset of such huge bilingual corpora might outperform the use of all the bilingual data. This paper studies such issues by analysing two training data selection techniques: one based on approximating the probability of an indomain corpus; and another based on infrequent n-gram occurrence. Experimental results not only report significant improvements over random sentence selection but also an improvement over a system trained with the whole available data. Surprisingly, the improvements are obtained with just a small fraction of the data that accounts for less than 0.5% of the sentences. Afterwards, we show that a much larger room for improvement exists, although this is done under non-realistic conditions. es_ES
dc.language Inglés es_ES
dc.publisher Association for Computational Linguistics es_ES
dc.relation European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement nr. 287755 es_ES
dc.relation Spanish MEC/MICINN under the MIPRCV ¿Consolider Ingenio 2010¿ program (CSD2007-00018) es_ES
dc.relation Spanish MEC/MICINN under iTrans2 (TIN2009- 14511) project es_ES
dc.relation Spanish MITyC under the erudito.com (TSI-020110- 2009-439) project es_ES
dc.relation Instituto Tecnológico de León es_ES
dc.relation DGEST-PROMEP es_ES
dc.relation CONACYT, México es_ES
dc.rights Reserva de todos los derechos es_ES
dc.subject Bilingual corpora es_ES
dc.subject Training data selection techniques es_ES
dc.subject Probability of an indomain corpus es_ES
dc.subject Infrequent n-gram occurrence es_ES
dc.subject.classification ESTADISTICA E INVESTIGACION OPERATIVA es_ES
dc.subject.classification LENGUAJES Y SISTEMAS INFORMATICOS es_ES
dc.title Does more data always yield better translations? es_ES
dc.type Comunicación en congreso es_ES
dc.relation.projectID info:eu-repo/grantAgreement/EC/FP7/287755 es_ES
dc.rights.accessRights Abierto es_ES
dc.contributor.affiliation Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació es_ES
dc.description.bibliographicCitation Gascó Mora, G.; Rocha Sánchez, MA.; Sanchis Trilles, G.; Andrés Ferrer, J.; Casacuberta Nolla, F. (2012). Does more data always yield better translations?. Association for Computational Linguistics. 152-161. http://hdl.handle.net/10251/35214 es_ES
dc.description.accrualMethod Senia es_ES
dc.relation.conferencename 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2012) es_ES
dc.relation.conferencedate 2012-04-23 es_ES
dc.relation.conferenceplace Avignon, Francia es_ES
dc.relation.publisherversion http://www.aclweb.org/anthology/E12-1016 es_ES
dc.description.upvformatpinicio 152 es_ES
dc.description.upvformatpfin 161 es_ES
dc.relation.senia 234920


This item appears in the following Collection(s)

Show simple item record