Gascó Mora, G.; Rocha Sánchez, MA.; Sanchis Trilles, G.; Andrés Ferrer, J.; Casacuberta Nolla, F. (2012). Does more data always yield better translations?. Association for Computational Linguistics. 152-161. http://hdl.handle.net/10251/35214
Por favor, use este identificador para citar o enlazar este ítem: http://hdl.handle.net/10251/35214
Title:
|
Does more data always yield better translations?
|
Author:
|
Gascó Mora, Guillem
Rocha Sánchez, Martha Alicia
Sanchis Trilles, Germán
Andrés Ferrer, Jesús
Casacuberta Nolla, Francisco
|
UPV Unit:
|
Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació
|
Issued date:
|
|
Abstract:
|
Nowadays, there are large amounts of data
available to train statistical machine translation
systems. However, it is not clear
whether all the training data actually help
or not. A system trained on a subset of such
huge ...[+]
Nowadays, there are large amounts of data
available to train statistical machine translation
systems. However, it is not clear
whether all the training data actually help
or not. A system trained on a subset of such
huge bilingual corpora might outperform
the use of all the bilingual data. This paper
studies such issues by analysing two training
data selection techniques: one based
on approximating the probability of an indomain
corpus; and another based on infrequent
n-gram occurrence. Experimental
results not only report significant improvements
over random sentence selection but
also an improvement over a system trained
with the whole available data. Surprisingly,
the improvements are obtained with just a
small fraction of the data that accounts for
less than 0.5% of the sentences. Afterwards,
we show that a much larger room for
improvement exists, although this is done
under non-realistic conditions.
[-]
|
Subjects:
|
Bilingual corpora
,
Training data selection techniques
,
Probability of an indomain corpus
,
Infrequent n-gram occurrence
|
Copyrigths:
|
Reserva de todos los derechos
|
ISBN:
|
978-1-937284-19-0
|
Publisher:
|
Association for Computational Linguistics
|
Publisher version:
|
http://www.aclweb.org/anthology/E12-1016
|
Conference name:
|
13th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2012)
|
Conference place:
|
Avignon, Francia
|
Conference date:
|
2012-04-23
|
Project ID:
|
info:eu-repo/grantAgreement/EC/FP7/287755
MICINN/CSD2007-00018
MICINN/TIN2009- 14511
MITYC/TSI-020110- 2009-439
|
Thanks:
|
The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under
grant agreement nr. 287755. This work was also supported by the Spanish MEC/MICINN under ...[+]
The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under
grant agreement nr. 287755. This work was also supported by the Spanish MEC/MICINN under the MIPRCV ”Consolider Ingenio 2010” program (CSD2007-00018), and iTrans2 (TIN2009-14511) project. Also supported by the Spanish MITyC under the erudito.com (TSI-020110-2009-439) project and Instituto Tecnológico de León, DGEST-PROMEP y CONACYT, México.
[-]
|
Type:
|
Comunicación en congreso
|