Does more data always yield better translations?

Gascó Mora, Guillem; Rocha Sánchez, Martha Alicia; Sanchis Trilles, Germán; Andrés Ferrer, Jesús; Casacuberta Nolla, Francisco

Identificarse

Buscar en RiuNet

Listar

Todo RiuNet
Esta colección

Mi cuenta

Acceder

Estadísticas

Ver Estadísticas de uso

Ayuda RiuNet

Admin. UPV

Compartir/Enviar a

Citas

Estadísticas

Does more data always yield better translations?

Mostrar el registro sencillo del ítem

Ficheros en el ítem

Nombre: E12-1016.pdf

Tamaño: 191.5Kb

Formato: PDF

Descripción: Versión editorial

Abrir

dc.contributor.author	Gascó Mora, Guillem	es_ES
dc.contributor.author	Rocha Sánchez, Martha Alicia	es_ES
dc.contributor.author	Sanchis Trilles, Germán	es_ES
dc.contributor.author	Andrés Ferrer, Jesús	es_ES
dc.contributor.author	Casacuberta Nolla, Francisco	es_ES
dc.date.accessioned	2014-01-29T07:19:58Z
dc.date.available	2014-01-29T07:19:58Z
dc.date.issued	2012-04-23
dc.identifier.isbn	978-1-937284-19-0
dc.identifier.uri	http://hdl.handle.net/10251/35214
dc.description.abstract	Nowadays, there are large amounts of data available to train statistical machine translation systems. However, it is not clear whether all the training data actually help or not. A system trained on a subset of such huge bilingual corpora might outperform the use of all the bilingual data. This paper studies such issues by analysing two training data selection techniques: one based on approximating the probability of an indomain corpus; and another based on infrequent n-gram occurrence. Experimental results not only report significant improvements over random sentence selection but also an improvement over a system trained with the whole available data. Surprisingly, the improvements are obtained with just a small fraction of the data that accounts for less than 0.5% of the sentences. Afterwards, we show that a much larger room for improvement exists, although this is done under non-realistic conditions.	es_ES
dc.description.sponsorship	The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement nr. 287755. This work was also supported by the Spanish MEC/MICINN under the MIPRCV ”Consolider Ingenio 2010” program (CSD2007-00018), and iTrans2 (TIN2009-14511) project. Also supported by the Spanish MITyC under the erudito.com (TSI-020110-2009-439) project and Instituto Tecnológico de León, DGEST-PROMEP y CONACYT, México.
dc.language	Inglés	es_ES
dc.publisher	Association for Computational Linguistics	es_ES
dc.rights	Reserva de todos los derechos	es_ES
dc.subject	Bilingual corpora	es_ES
dc.subject	Training data selection techniques	es_ES
dc.subject	Probability of an indomain corpus	es_ES
dc.subject	Infrequent n-gram occurrence	es_ES
dc.subject.classification	ESTADISTICA E INVESTIGACION OPERATIVA	es_ES
dc.subject.classification	LENGUAJES Y SISTEMAS INFORMATICOS	es_ES
dc.title	Does more data always yield better translations?	es_ES
dc.type	Comunicación en congreso	es_ES
dc.relation.projectID	info:eu-repo/grantAgreement/EC/FP7/287755/EU/Transcription and Translation of Video Lectures/	es_ES
dc.relation.projectID	info:eu-repo/grantAgreement/MEC//CSD2007-00018/ES/Multimodal Intraction in Pattern Recognition and Computer Visionm/	es_ES
dc.relation.projectID	info:eu-repo/grantAgreement/MICINN//TIN2009-14511/ES/Traduccion De Textos Y Transcripcion De Voz Interactivas/	es_ES
dc.relation.projectID	info:eu-repo/grantAgreement/MITURCO//TSI-020110-2009-0439/ES/ERUDITO.COM/	es_ES
dc.rights.accessRights	Abierto	es_ES
dc.contributor.affiliation	Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació	es_ES
dc.description.bibliographicCitation	Gascó Mora, G.; Rocha Sánchez, MA.; Sanchis Trilles, G.; Andrés Ferrer, J.; Casacuberta Nolla, F. (2012). Does more data always yield better translations?. Association for Computational Linguistics. 152-161. http://hdl.handle.net/10251/35214	es_ES
dc.description.accrualMethod	S	es_ES
dc.relation.conferencename	13th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2012)	es_ES
dc.relation.conferencedate	2012-04-23	es_ES
dc.relation.conferenceplace	Avignon, Francia	es_ES
dc.relation.publisherversion	http://www.aclweb.org/anthology/E12-1016	es_ES
dc.description.upvformatpinicio	152	es_ES
dc.description.upvformatpfin	161	es_ES
dc.relation.senia	234920
dc.contributor.funder	European Commission
dc.contributor.funder	Ministerio de Ciencia e Innovación
dc.contributor.funder	Instituto Tecnológico de León, México
dc.contributor.funder	Dirección General de Educación Superior Tecnológica, México
dc.contributor.funder	Consejo Nacional de Ciencia y Tecnología, México
dc.contributor.funder	Ministerio de Educación y Ciencia	es_ES
dc.contributor.funder	Ministerio de Industria, Turismo y Comercio	es_ES

Este ítem aparece en la(s) siguiente(s) colección(ones)

Mostrar el registro sencillo del ítem

Does more data always yield better translations?

RiuNet: Repositorio Institucional de la Universidad Politécnica de Valencia

Buscar en RiuNet

Listar

Todo RiuNet

Esta colección

Mi cuenta

Estadísticas

Ayuda RiuNet

Admin. UPV

Compartir/Enviar a

Citas

Estadísticas

Does more data always yield better translations?

Ficheros en el ítem

Este ítem aparece en la(s) siguiente(s) colección(ones)