Language identification of multilingual posts from Twitter: a case study

Pla Santamaría, Ferran; Hurtado Oliver, Lluis Felip

doi:10.1007/s10115-016-0997-x

Identificarse

Buscar en RiuNet

Listar

Todo RiuNet
Esta colección

Mi cuenta

Acceder

Estadísticas

Ver Estadísticas de uso

Ayuda RiuNet

Admin. UPV

Compartir/Enviar a

Citas

Estadísticas

Language identification of multilingual posts from Twitter: a case study

Mostrar el registro sencillo del ítem

Ficheros en el ítem

Nombre: kais_fpla_2016_fi ...

Tamaño: 1.400Mb

Formato: PDF

Descripción: Versión del Autor.

Abrir

Nombre: Pla;Hurtado - ...

Tamaño: 930.9Kb

Formato: PDF

Descripción: Versión editorial

Solicitar una copia al autor

dc.contributor.author	Pla Santamaría, Ferran	es_ES
dc.contributor.author	Hurtado Oliver, Lluis Felip	es_ES
dc.date.accessioned	2017-05-17T08:45:51Z
dc.date.available	2017-05-17T08:45:51Z
dc.date.issued	2016-09-29
dc.identifier.issn	0219-1377
dc.identifier.uri	http://hdl.handle.net/10251/81256
dc.description	The final publication is available at Springer via http://dx.doi.org/10.1007/s10115-016-0997-x	es_ES
dc.description.abstract	This paper describes a method for handling multi-class and multi-label classification problems based on the support vector machine formalism. This method has been applied to the language identification problem in Twitter. The system evaluation was performed mainly on a Twitter data set developed in the TweetLID workshop. This data set contains bilingual tweets written in the most commonly used Iberian languages (i.e., Spanish, Portuguese, Catalan, Basque, and Galician) as well as the English language. We address the following problems: (1) social media texts. We propose a suitable tokenization that processes the peculiarities of Twitter; (2) multilingual tweets. Since a tweet can belong to more than one language, we need to use a multi-class and multi-label classifier; (3) similar languages. We study the main confusions among similar languages; and (4) unbalanced classes. We propose threshold-based strategy to favor classes with less data. We have also studied the use of Wikipedia and the addition of new tweets in order to increase the training data set. Additionally, we have tested our system on Bergsma corpus, a collection of tweets in nine languages, focusing on confusable languages using the Cyrillic, Arabic, and Devanagari alphabets. To our knowledge, we obtained the best results published on the TweetLID data set and results that are in line with the best results published on Bergsma data set.	es_ES
dc.description.sponsorship	This work has been partially funded by the project ASLP-MULAN: Audio, Speech and Language Processing for Multimedia Analytics (MINECO TIN2014-54288-C4-3-R).	en_EN
dc.language	Inglés	es_ES
dc.publisher	Springer Verlag (Germany)	es_ES
dc.relation.ispartof	Knowledge and Information Systems	es_ES
dc.rights	Reserva de todos los derechos	es_ES
dc.subject	Natural language processing	es_ES
dc.subject	Language identification	es_ES
dc.subject	Multi-label classification	es_ES
dc.subject	Support vector machines	es_ES
dc.subject	Twitter	es_ES
dc.subject.classification	LENGUAJES Y SISTEMAS INFORMATICOS	es_ES
dc.title	Language identification of multilingual posts from Twitter: a case study	es_ES
dc.type	Artículo	es_ES
dc.identifier.doi	10.1007/s10115-016-0997-x
dc.relation.projectID	info:eu-repo/grantAgreement/MINECO//TIN2014-54288-C4-3-R/ES/PROCESADO DE AUDIO, HABLA Y LENGUAJE PARA ANALISIS DE INFORMACION MULTIMEDIA/	es_ES
dc.rights.accessRights	Abierto	es_ES
dc.contributor.affiliation	Universitat Politècnica de València. Escola Tècnica Superior d'Enginyeria Informàtica	es_ES
dc.description.bibliographicCitation	Pla Santamaría, F.; Hurtado Oliver, LF. (2016). Language identification of multilingual posts from Twitter: a case study. Knowledge and Information Systems. 51(3):965-989. https://doi.org/10.1007/s10115-016-0997-x	es_ES
dc.description.accrualMethod	S	es_ES
dc.relation.publisherversion	http://dx.doi.org/10.1007/s10115-016-0997-x	es_ES
dc.description.upvformatpinicio	965	es_ES
dc.description.upvformatpfin	989	es_ES
dc.type.version	info:eu-repo/semantics/publishedVersion	es_ES
dc.description.volume	51	es_ES
dc.description.issue	3	es_ES
dc.relation.senia	325934	es_ES
dc.contributor.funder	Ministerio de Economía y Competitividad	es_ES
dc.description.references	Baldwin T, Lui M (2010) Language identification: the long and the short of the matter. In: Human language technologies: the 2010 annual conference of the North American chapter of the association for computational linguistics, HLT ‘10. Association for Computational Linguistics, Stroudsburg, PA, pp 229–237	es_ES
dc.description.references	Bergsma S, McNamee P, Bagdouri M, Fink C, Wilson T (2012) Language identification for creating language-specific twitter collections. In: Proceedings of the second workshop on language in social media, LSM ‘12. Association for Computational Linguistics, Stroudsburg, PA, pp 65–74	es_ES
dc.description.references	Carter S, Weerkamp W, Tsagkias M (2013) Microblog language identification: overcoming the limitations of short, unedited and idiomatic text. Lang Resour Eval 47(1):195–215	es_ES
dc.description.references	Cavnar WB, Trenkle JM (1994) N-gram-based text categorization. In: Proceedings of SDAIR-94, 3rd annual symposium on document analysis and information retrieval, pp. 161–175	es_ES
dc.description.references	Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297	es_ES
dc.description.references	Gamallo P, García M, Sotelo S, Campos JRP (2014) Comparing ranking-based and naive bayes approaches to language detection on tweets. ‘TweetLID@SEPLN’, pp 12–16	es_ES
dc.description.references	Goldszmidt M, Najork M, Paparizos S (2013) Boot-strapping language identifiers for short colloquial postings. In: Proceeding of the European conference on machine learning and principles and practice of knowledge discovery in databases (ECMLPKDD 2013). Springer	es_ES
dc.description.references	Grefenstette G (1995) Comparing two language identification schemes. In: 3rd international conference on statistical analysis of textural data	es_ES
dc.description.references	Hurtado LF, Pla F, Giménez M, Arnal ES (2014) Elirf-upv en tweetlid: Identificación del idioma en twitter, In: Proceedings of the Tweet language identification workshop co-located with 30th conference of the Spanish society for natural language processing, TweetLID@SEPLN 2014, Girona, 16 Sept 2014, pp 35–38	es_ES
dc.description.references	Jauhiainen T, Lindén K, Jauhiainen H (2015) Language set identification in noisy synthetic multilingual documents. In: Gelbukh A (ed) Computational linguistics and intelligent text processing, vol 9041 of lecture notes in computer science. Springer International Publishing, pp 633–643	es_ES
dc.description.references	Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Nédellec C, Rouveirol C (eds) Proceedings of ECML-98, 10th European conference on machine learning, no. 1398. Springer, Heidelberg, pp 137–142	es_ES
dc.description.references	Liu B (2012) Sentiment analysis and opinion mining. A comprehensive introduction and survey. Morgan & Claypool Publishers, San Rafael	es_ES
dc.description.references	Ljubešić N, Mikelić N, Boras D (2007) Language identification: How to distinguish similar languages, In: Lužar-Stifter V, Hljuz Dobrić V (eds), Proceedings of the 29th international conference on information technology interfaces. SRCE University Computing Centre, Zagreb, pp 541–546	es_ES
dc.description.references	Lui M, Baldwin T (2014) Accurate language identification of twitter messages. In: Proceedings of the EACL 2014 workshop on language analysis in social media (LASM 2014), pp 17–25	es_ES
dc.description.references	Lui M, Lau JH, Baldwin T (2014) Automatic detection and language identification of multilingual documents. Trans Assoc Comput Linguist 2:27–40	es_ES
dc.description.references	Nguyen D, Dogruoz AS (2014) Word level language identification in online multilingual communication. In: Proceedings of the 2013 conference on empirical methods in natural language processing	es_ES
dc.description.references	O’Connor B, Krieger M, Ahn D (2010) Tweetmotif: exploratory search and topic summarization for twitter. In: Cohen WW, Gosling S (eds) Proceedings of the fourth international conference on weblogs and social media, ICWSM 2010, Washington, DC. The AAAI Press, 23–26 May 2010	es_ES
dc.description.references	Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830	es_ES
dc.description.references	Pla F, Hurtado L-F (2014) Political tendency identification in twitter using sentiment analysis techniques. In: Proceedings of COLING 2014, the 25th international conference on computational linguistics: technical papers. Dublin City University and Association for Computational Linguistics, Dublin, pp 183–192	es_ES
dc.description.references	Prager JM (1999) Linguini: language identification for multilingual documents. J Manage Inf Syst 16(3):71–101	es_ES
dc.description.references	Ramón Quevedo J, Luaces O, Bahamonde A (2012) Multilabel classifiers with a probabilistic thresholding strategy. Pattern Recogn 45(2):876–883	es_ES
dc.description.references	Rao D, Yarowsky D, Shreevats A, Gupta M (2010) Classifying latent user attributes in twitter. In: Proceedings of the 2nd international workshop on search and mining user-generated contents, SMUC ‘10. ACM, New York, NY, pp 37–44	es_ES
dc.description.references	Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47	es_ES
dc.description.references	Tsoumakas G, Katakis I (2007) Multi-label classification: an overview. Int J Data Warehous Min 2007:1–13	es_ES
dc.description.references	Zubiaga A, Vicente IS, Gamallo P, Campos JRP, Loinaz IA, Aranberri N, Ezeiza A Fresno-Fernández V (2014) Overview of tweetlid: Tweet language identification at SEPLN 2014. In: Proceedings of the Tweet language identification workshop co-located with 30th conference of the Spanish society for natural language processing. TweetLID@SEPLN 2014, Girona, Spain, 16 Sept 2014, pp 1–11	es_ES
dc.description.references	Zubiaga A, San Vicente I, Gamallo P, Pichel JR, Alegria I, Aranberri N, Ezeiza A, Fresno V (2015) TweetLID: a benchmark for tweet language identification. J Lang Res Eval. Springer, pp 1–38. doi: 10.1007/s10579-015-9317-4	es_ES

Este ítem aparece en la(s) siguiente(s) colección(ones)

Artículos, conferencias, monografías [48357]

Mostrar el registro sencillo del ítem

Language identification of multilingual posts from Twitter: a case study

RiuNet: Repositorio Institucional de la Universidad Politécnica de Valencia

Buscar en RiuNet

Listar

Todo RiuNet

Esta colección

Mi cuenta

Estadísticas

Ayuda RiuNet

Admin. UPV

Compartir/Enviar a

Citas

Estadísticas

Language identification of multilingual posts from Twitter: a case study

Ficheros en el ítem

Este ítem aparece en la(s) siguiente(s) colección(ones)