Does more data always yield better translations?

Gascó Mora, Guillem; Rocha Sánchez, Martha Alicia; Sanchis Trilles, Germán; Andrés Ferrer, Jesús; Casacuberta Nolla, Francisco

Does more data always yield better translations?

Archivos

E12-1016.pdf (191.51 KB)

Fecha

2012-04-23

Autores

Gascó Mora, Guillem

Rocha Sánchez, Martha Alicia

Sanchis Trilles, Germán

Andrés Ferrer, Jesús

Casacuberta Nolla, Francisco

Unidades organizativas

Centro de Investigación Pattern Recognition and Human Language Technology

Compartir

Handle

https://riunet.upv.es/handle/10251/35214

Cita bibliográfica

Gascó Mora, G.; Rocha Sánchez, MA.; Sanchis Trilles, G.; Andrés Ferrer, J.; Casacuberta Nolla, F. (2012). Does more data always yield better translations?. Association for Computational Linguistics. 152-161. https://riunet.upv.es/handle/10251/35214

Resumen

Nowadays, there are large amounts of data available to train statistical machine translation systems. However, it is not clear whether all the training data actually help or not. A system trained on a subset of such huge bilingual corpora might outperform the use of all the bilingual data. This paper studies such issues by analysing two training data selection techniques: one based on approximating the probability of an indomain corpus; and another based on infrequent n-gram occurrence. Experimental results not only report significant improvements over random sentence selection but also an improvement over a system trained with the whole available data. Surprisingly, the improvements are obtained with just a small fraction of the data that accounts for less than 0.5% of the sentences. Afterwards, we show that a much larger room for improvement exists, although this is done under non-realistic conditions.

Palabras clave

Bilingual corpora, Training data selection techniques, Probability of an indomain corpus, Infrequent n-gram occurrence

Versión del editor

http://www.aclweb.org/anthology/E12-1016

Colecciones

Artículos, conferencias, monografías
OpenAIRE (Open Access Infrastructure for Research in Europe)

Página completa del ítem

Does more data always yield better translations?

Archivos

Fecha

Autores

Directores

Editores

Otras autorías

Unidades organizativas

Compartir

Handle

Cita bibliográfica

Titulación

Resumen

Palabras clave

Fuente

DOI

Versión del editor

Enlaces relacionados

URL

Colecciones