- -

A hybrid approach for transliterated word-level language identification: CRF with post processing heuristics

RiuNet: Repositorio Institucional de la Universidad Politécnica de Valencia

Compartir/Enviar a

Citas

Estadísticas

  • Estadisticas de Uso

A hybrid approach for transliterated word-level language identification: CRF with post processing heuristics

Mostrar el registro sencillo del ítem

Ficheros en el ítem

dc.contributor.author Banerjee, Somnath es_ES
dc.contributor.author Kuila, Alapan es_ES
dc.contributor.author Roy, Aniruddha es_ES
dc.contributor.author Naskar, Sudip Kumar es_ES
dc.contributor.author Rosso, Paolo es_ES
dc.contributor.author Bandyopadhyay, Sivaji es_ES
dc.date.accessioned 2016-06-23T15:08:38Z
dc.date.available 2016-06-23T15:08:38Z
dc.date.issued 2014-12-05
dc.identifier.isbn 978-1-4503-3755-7
dc.identifier.uri http://hdl.handle.net/10251/66381
dc.description © {Owner/Author | ACM} {Year}. This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in FIRE '14 Proceedings of the Forum for Information Retrieval Evaluation, http://dx.doi.org/10.1145/2824864.2824876 es_ES
dc.description.abstract [EN] In this paper, we describe a hybrid approach for word-level language (WLL) identification of Bangla words written in Roman script and mixed with English words as part of our participation in the shared task on transliterated search at Forum for Information Retrieval Evaluation (FIRE) in 2014. A CRF based machine learning model and post-processing heuristics are employed for the WLL identification task. In addition to language identification, two transliteration systems were built to transliterate detected Bangla words written in Roman script into native Bangla script. The system demonstrated an overall token level language identification accuracy of 0.905. The token level Bangla and English language identification F-scores are 0.899, 0.920 respectively. The two transliteration systems achieved accuracies of 0.062 and 0.037. The word-level language identification system presented in this paper resulted in the best scores across almost all metrics among all the participating systems for the Bangla-English language pair. es_ES
dc.description.sponsorship We acknowledge the support of the Department of Electronics and Information Technology (DeitY), Government of India, through the project “CLIA System Phase II”. The research work of the last author was carried out in the framework of WIQ-EI IRSES (Grant No. 269180) within the FP 7 Marie Curie, DIANA-APPLICATIONS (TIN2012-38603-C02-01) projects and the VLC/CAMPUS Microcluster on Multimodal Interaction in Intelligent Systems. es_ES
dc.format.extent 4 es_ES
dc.language Inglés es_ES
dc.publisher ACM es_ES
dc.relation.ispartof FIRE '14 Proceedings of the Forum for Information Retrieval Evaluation es_ES
dc.rights Reserva de todos los derechos es_ES
dc.subject Code switch es_ES
dc.subject Transliteration es_ES
dc.subject Word-Level Language Identification es_ES
dc.subject.classification LENGUAJES Y SISTEMAS INFORMATICOS es_ES
dc.title A hybrid approach for transliterated word-level language identification: CRF with post processing heuristics es_ES
dc.type Capítulo de libro es_ES
dc.type Comunicación en congreso es_ES
dc.identifier.doi 10.1145/2824864.2824876
dc.relation.projectID info:eu-repo/grantAgreement/MINECO//TIN2012-38603-C02-01/ES/DIANA-APPLICATIONS: FINDING HIDDEN KNOWLEDGE IN TEXTS: APPLICATIONS/ es_ES
dc.relation.projectID info:eu-repo/grantAgreement/EC/FP7/269180/EU/Web Information Quality Evaluation Initiative/ es_ES
dc.rights.accessRights Abierto es_ES
dc.contributor.affiliation Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació es_ES
dc.description.bibliographicCitation Banerjee, S.; Kuila, A.; Roy, A.; Naskar, SK.; Rosso, P.; Bandyopadhyay, S. (2014). A hybrid approach for transliterated word-level language identification: CRF with post processing heuristics. En FIRE '14 Proceedings of the Forum for Information Retrieval Evaluation. ACM. 170-173. https://doi.org/10.1145/2824864.2824876 es_ES
dc.description.accrualMethod S es_ES
dc.relation.conferencename 6th Forum for Information Retrieval Evaluation (FIRE 2014) es_ES
dc.relation.conferencedate December, 5-7, 2014 es_ES
dc.relation.conferenceplace Bangalore, India es_ES
dc.relation.publisherversion http://dx.doi.org/10.1145/2824864.2824876 es_ES
dc.description.upvformatpinicio 170 es_ES
dc.description.upvformatpfin 173 es_ES
dc.type.version info:eu-repo/semantics/publishedVersion es_ES
dc.relation.senia 283914 es_ES
dc.contributor.funder Department of Electronics and Information Technology, Ministry of Communications and Information Technology, India es_ES
dc.contributor.funder European Commission es_ES
dc.contributor.funder Universitat de València es_ES
dc.contributor.funder Ministerio de Economía y Competitividad es_ES
dc.description.references Y. Al-Onaizan and K. Knight. Named entity translation: Extended abstract. In HLT, pages 122--124. Singapore, 2002. es_ES
dc.description.references P. J. Antony, V. P. Ajith, and K. P. Suman. Feature extraction based english to kannada transliteration. In In hird International conference on Semantic E-business and Enterprise Computing. SEEC 2010, 2010. es_ES
dc.description.references P. J. Antony, V. P. Ajith, and K. P. Suman. Kernel method for english to kannada transliteration. In International conference on-Recent trends in Information, Telecommunication and computing. ITC2010, 2010. es_ES
dc.description.references M. Arbabi, S. M. Fischthal, V. C. Cheng, and E. Bart. Algorithms for arabic name transliteration. In IBM Journal of Research and Development, page 183. TeX Users Group, 1994. es_ES
dc.description.references S. Banerjee, S. Naskar, and S. Bandyopadhyay. Bengali named entity recognition using margin infused relaxed algorithm. In TSD, pages 125--132. Springer International Publishing, 2014. es_ES
dc.description.references U. Barman, J. Wagner, G. Chrupala, and J. Foster. Identification of languages and encodings in a multilingual document. page 127. EMNLP, 2014. es_ES
dc.description.references K. R. Beesley. Language identifier: A computer program for automatic natural-language identification of on-line text. pages 47--54. ATA, 1988. es_ES
dc.description.references P. F. Brown, S. A. D. Pietra, V. J. D. Pietra, and R. L. Mercer. Mercer: The mathematics of statistical machine translation: parameter estimation. pages 263--311. Computational Linguistics, 1993. es_ES
dc.description.references M. Carpuat. Mixed-language and code-switching in the canadian hansard. page 107. EMNLP, 2014. es_ES
dc.description.references G. Chittaranjan, Y. Vyas, K. Bali, and M. Choudhury. Word-level language identification using crf: Code-switching shared task report of msr india system. pages 73--79. EMNLP, 2014. es_ES
dc.description.references A. Das, A. Ekbal, T. Mandal, and S. Bandyopadhyay. English to hindi machine transliteration system at news. pages 80--83. Proceeding of the Named Entities Workshop ACL-IJCNLP, Singapore, 2009. es_ES
dc.description.references A. Ekbal, S. Naskar, and S. Bandyopadhyay. A modified joint source channel model for transliteration. pages 191--198. COLING-ACL Australia, 2006. es_ES
dc.description.references I. Goto, N. Kato, N. Uratani, and T. Ehara. Transliteration considering context information based on the maximum entropy method. pages 125--132. MT-Summit IX, New Orleans, USA, 2003. es_ES
dc.description.references R. Haque, S. Dandapat, A. K. Srivastava, S. K. Naskar, and A. Way. English to hindi transliteration using context-informed pb-smt:the dcu system for news 2009. NEWS 2009, 2009. es_ES
dc.description.references S. Y. Jung, S. Hong, and E. Paek. An english to korean transliteration model of extended markov window. es_ES
dc.description.references S. Y. Jung, S. L. Hong, and E. Paek. An english to korean transliteration model of extended markov window. pages 383--389. COLING, 2000. es_ES
dc.description.references B. J. Kang and K. S. Choi. Automatic transliteration and back-transliteration by decision tree learning. LERC, May 2000. es_ES
dc.description.references B. King and S. Abney. Labeling the languages of words in mixed-language documents using weakly supervised methods. pages 1110--1119. NAACL-HLT, 2013. es_ES
dc.description.references R. Kneser and H. Ney. Improved backing-off for m-gram language modeling. In ICASSP, pages 181--184. Detroit, MI, 1995. es_ES
dc.description.references R. Kneser and H. Ney. SRILM-an extensible language modeling toolkit. In Intl. Conf. on Spoken Language Processing, pages 901--904, 2002. es_ES
dc.description.references K. Knight and J. Graehl. Machine transliteration. in computational linguistics. pages 599--612, 1998. es_ES
dc.description.references P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, and E. Herbst. Moses: open source toolkit for statistical machine translation. In ACL, pages 177--180, 2007. es_ES
dc.description.references P. Koehn, F. J. Och, and D. Marcu. Statistical phrase-based translation. In HLT-NAACL, 2003. es_ES
dc.description.references A. Kumaran and T. Kellner. A generic framework for machine transliteration. In 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 721--722. ACM, 2007. es_ES
dc.description.references H. Li, Z. Min, and J. Su. A joint source-channel model for machine transliteration. In ACL, page 159, 2004. es_ES
dc.description.references C. Lignos and M. Marcus. Toward web-scale analysis of codeswitching. In Annual Meeting of the Linguistic Society of America, 2013. es_ES
dc.description.references J. H. Oh and K. S. Choi. An english-korean transliteration model using pronunciation and contextual rules. In 19th international conference on Computational linguistics. ACL, 2002. es_ES
dc.description.references T. Rama and K. Gali. Modeling machine transliteration as a phrase based statistical machine translation problem. In Language Technologies Research Centre. IIIT, Hyderabad, India, 2009. es_ES
dc.description.references A. K. Singh and J. Gorla. Identification of languages and encodings in a multilingual document. In ACL-SIGWAC's Web As Corpus3, page 95. Presses univ. de Louvain, 2007. es_ES
dc.description.references V. Sowmya, M. Choudhury, K. Bali, T. Dasgupta, and A. Basu. Resource creation for training and testing of transliteration systems for indian languages. In LREC, pages 2902--2907, 2010. es_ES
dc.description.references V. Sowmya and V. Varma. Transliteration based text input methods for telugu. In ICCPOL-2009, 2009. es_ES
dc.description.references B. G. Stalls and J. Graehl. Translating names and technical terms in arabic text. In Workshop on Computational Approaches to Semitic Languages, pages 34--41. ACL, 1998. es_ES
dc.description.references S. Sumaja, R. Loganathan, and K. P. Suman. English to malayalam transliteration using sequence labeling approach. International Journal of Recent Trends in Engineering, 1(2), 2009. es_ES
dc.description.references M. S. Vijaya, V. P. Ajith, G. Shivapratap, and K. P. Soman. English to tamil transliteration using weka. International Journal of Recent Trends in Engineering, 2009. es_ES


Este ítem aparece en la(s) siguiente(s) colección(ones)

Mostrar el registro sencillo del ítem