Mostrar el registro sencillo del ítem
dc.contributor.author | Gascó Mora, Guillem | es_ES |
dc.contributor.author | Rocha Sánchez, Martha Alicia | es_ES |
dc.contributor.author | Sanchis Trilles, Germán | es_ES |
dc.contributor.author | Andrés Ferrer, Jesús | es_ES |
dc.contributor.author | Casacuberta Nolla, Francisco | es_ES |
dc.date.accessioned | 2014-01-29T07:19:58Z | |
dc.date.available | 2014-01-29T07:19:58Z | |
dc.date.issued | 2012-04-23 | |
dc.identifier.isbn | 978-1-937284-19-0 | |
dc.identifier.uri | http://hdl.handle.net/10251/35214 | |
dc.description.abstract | Nowadays, there are large amounts of data available to train statistical machine translation systems. However, it is not clear whether all the training data actually help or not. A system trained on a subset of such huge bilingual corpora might outperform the use of all the bilingual data. This paper studies such issues by analysing two training data selection techniques: one based on approximating the probability of an indomain corpus; and another based on infrequent n-gram occurrence. Experimental results not only report significant improvements over random sentence selection but also an improvement over a system trained with the whole available data. Surprisingly, the improvements are obtained with just a small fraction of the data that accounts for less than 0.5% of the sentences. Afterwards, we show that a much larger room for improvement exists, although this is done under non-realistic conditions. | es_ES |
dc.description.sponsorship | The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement nr. 287755. This work was also supported by the Spanish MEC/MICINN under the MIPRCV ”Consolider Ingenio 2010” program (CSD2007-00018), and iTrans2 (TIN2009-14511) project. Also supported by the Spanish MITyC under the erudito.com (TSI-020110-2009-439) project and Instituto Tecnológico de León, DGEST-PROMEP y CONACYT, México. | |
dc.language | Inglés | es_ES |
dc.publisher | Association for Computational Linguistics | es_ES |
dc.rights | Reserva de todos los derechos | es_ES |
dc.subject | Bilingual corpora | es_ES |
dc.subject | Training data selection techniques | es_ES |
dc.subject | Probability of an indomain corpus | es_ES |
dc.subject | Infrequent n-gram occurrence | es_ES |
dc.subject.classification | ESTADISTICA E INVESTIGACION OPERATIVA | es_ES |
dc.subject.classification | LENGUAJES Y SISTEMAS INFORMATICOS | es_ES |
dc.title | Does more data always yield better translations? | es_ES |
dc.type | Comunicación en congreso | es_ES |
dc.relation.projectID | info:eu-repo/grantAgreement/EC/FP7/287755/EU/Transcription and Translation of Video Lectures/ | es_ES |
dc.relation.projectID | info:eu-repo/grantAgreement/MEC//CSD2007-00018/ES/Multimodal Intraction in Pattern Recognition and Computer Visionm/ | es_ES |
dc.relation.projectID | info:eu-repo/grantAgreement/MICINN//TIN2009-14511/ES/Traduccion De Textos Y Transcripcion De Voz Interactivas/ | es_ES |
dc.relation.projectID | info:eu-repo/grantAgreement/MITURCO//TSI-020110-2009-0439/ES/ERUDITO.COM/ | es_ES |
dc.rights.accessRights | Abierto | es_ES |
dc.contributor.affiliation | Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació | es_ES |
dc.description.bibliographicCitation | Gascó Mora, G.; Rocha Sánchez, MA.; Sanchis Trilles, G.; Andrés Ferrer, J.; Casacuberta Nolla, F. (2012). Does more data always yield better translations?. Association for Computational Linguistics. 152-161. http://hdl.handle.net/10251/35214 | es_ES |
dc.description.accrualMethod | S | es_ES |
dc.relation.conferencename | 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2012) | es_ES |
dc.relation.conferencedate | 2012-04-23 | es_ES |
dc.relation.conferenceplace | Avignon, Francia | es_ES |
dc.relation.publisherversion | http://www.aclweb.org/anthology/E12-1016 | es_ES |
dc.description.upvformatpinicio | 152 | es_ES |
dc.description.upvformatpfin | 161 | es_ES |
dc.relation.senia | 234920 | |
dc.contributor.funder | European Commission | |
dc.contributor.funder | Ministerio de Ciencia e Innovación | |
dc.contributor.funder | Instituto Tecnológico de León, México | |
dc.contributor.funder | Dirección General de Educación Superior Tecnológica, México | |
dc.contributor.funder | Consejo Nacional de Ciencia y Tecnología, México | |
dc.contributor.funder | Ministerio de Educación y Ciencia | es_ES |
dc.contributor.funder | Ministerio de Industria, Turismo y Comercio | es_ES |