Fine-Grained Analysis of Language Varieties and Demographics

Rangel, Francisco; Rosso, Paolo; Zaghouani, Wajdi; Charfi, Anis

doi:10.1017/S1351324920000108

Identificarse

Buscar en RiuNet

Listar

Todo RiuNet
Esta colección

Mi cuenta

Acceder

Estadísticas

Ver Estadísticas de uso

Ayuda RiuNet

Admin. UPV

Compartir/Enviar a

Citas

Estadísticas

Fine-Grained Analysis of Language Varieties and Demographics

Mostrar el registro sencillo del ítem

Ficheros en el ítem

Nombre: Rangel;Rosso;Zaghouan ...

Tamaño: 3.401Mb

Formato: PDF

Descripción: Versión del Autor.

Abrir

Nombre: NLEJ-editorial.pdf

Tamaño: 1.398Mb

Formato: PDF

Descripción: Versión editorial

Solicitar una copia al autor

dc.contributor.author	Rangel, Francisco	es_ES
dc.contributor.author	Rosso, Paolo	es_ES
dc.contributor.author	Zaghouani, Wajdi	es_ES
dc.contributor.author	Charfi, Anis	es_ES
dc.date.accessioned	2021-05-27T03:34:35Z
dc.date.available	2021-05-27T03:34:35Z
dc.date.issued	2020-11	es_ES
dc.identifier.issn	1351-3249	es_ES
dc.identifier.uri	http://hdl.handle.net/10251/166834
dc.description.abstract	[EN] The rise of social media empowers people to interact and communicate with anyone anywhere in the world. The possibility of being anonymous avoids censorship and enables freedom of expression. Nevertheless, this anonymity might lead to cybersecurity issues, such as opinion spam, sexual harassment, incitement to hatred or even terrorism propaganda. In such cases, there is a need to know more about the anonymous users and this could be useful in several domains beyond security and forensics such as marketing, for example. In this paper, we focus on a fine-grained analysis of language varieties while considering also the authors¿ demographics. We present a Low-Dimensionality Statistical Embedding method to represent text documents. We compared the performance of this method with the best performing teams in the Author Profiling task at PAN 2017. We obtained an average accuracy of 92.08% versus 91.84% for the best performing team at PAN 2017. We also analyse the relationship of the language variety identification with the authors¿ gender. Furthermore, we applied our proposed method to a more fine-grained annotated corpus of Arabic varieties covering 22 Arab countries and obtained an overall accuracy of 88.89%. We have also investigated the effect of the authors¿ age and gender on the identification of the different Arabic varieties, as well as the effect of the corpus size on the performance of our method.	es_ES
dc.description.sponsorship	This publication was made possible by NPRP grant 9-175-1-033 from the Qatar National Research Fund (a member of Qatar Foundation). The statements made herein are solely the responsibility of the authors.	es_ES
dc.language	Inglés	es_ES
dc.publisher	Cambridge University Press	es_ES
dc.relation.ispartof	Natural Language Engineering	es_ES
dc.rights	Reconocimiento - No comercial - Sin obra derivada (by-nc-nd)	es_ES
dc.subject	Language variety identification	es_ES
dc.subject	Demographics	es_ES
dc.subject	Gender	es_ES
dc.subject	Age	es_ES
dc.subject	Author profiling	es_ES
dc.subject	Cybersecurity	es_ES
dc.subject	Arabic	es_ES
dc.subject.classification	LENGUAJES Y SISTEMAS INFORMATICOS	es_ES
dc.title	Fine-Grained Analysis of Language Varieties and Demographics	es_ES
dc.type	Artículo	es_ES
dc.identifier.doi	10.1017/S1351324920000108	es_ES
dc.relation.projectID	info:eu-repo/grantAgreement/QNRF//NPRP 9-175-1-033/	es_ES
dc.rights.accessRights	Abierto	es_ES
dc.contributor.affiliation	Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació	es_ES
dc.description.bibliographicCitation	Rangel, F.; Rosso, P.; Zaghouani, W.; Charfi, A. (2020). Fine-Grained Analysis of Language Varieties and Demographics. Natural Language Engineering. 26(6):641-661. https://doi.org/10.1017/S1351324920000108	es_ES
dc.description.accrualMethod	S	es_ES
dc.relation.publisherversion	https://doi.org/10.1017/S1351324920000108	es_ES
dc.description.upvformatpinicio	641	es_ES
dc.description.upvformatpfin	661	es_ES
dc.type.version	info:eu-repo/semantics/publishedVersion	es_ES
dc.description.volume	26	es_ES
dc.description.issue	6	es_ES
dc.relation.pasarela	S\433808	es_ES
dc.contributor.funder	Carnegie Mellon University	es_ES
dc.contributor.funder	Qatar National Research Fund	es_ES
dc.description.references	Kestemont, M. , Tschuggnall, M. , Stamatatos, E. , Daelemans, W. , Specht, G. , Stein, B. and Potthast, M. (2018). Overview of the Author Identification Task at PAN-2018: Cross-domain Authorship Attribution and Style Change Detection. CLEF 2018 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings. CEUR-WS.org.	es_ES
dc.description.references	McNemar, Q. (1947). Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika, 12(2), 153-157. doi:10.1007/bf02295996	es_ES
dc.description.references	Lui, M. and Cook, P. (2013). Classifying english documents by national dialect. In Proceedings of the Australasian Language Technology Association Workshop, Citeseer pp. 5–15.	es_ES
dc.description.references	Basile, A. , Dwyer, G. , Medvedeva, M. , Rawee, J. , Haagsma, H. and Nissim, M. (2017). Is there life beyond n-grams? A simple SVM-based author profiling system. In Cappellato L., Ferro N., Goeuriot L. and Mandl T. (eds), CLEF 2017 Working Notes. CEUR Workshop Proceedings (CEUR-WS.org), ISSN 1613-0073, http://ceur-ws.org/Vol-/. CLEF and CEUR-WS.org.	es_ES
dc.description.references	Elfardy, H. and Diab, M.T. (2013). Sentence level dialect identification in arabic. In Association for Computational Linguistics (ACL), pp. 456–461.	es_ES
dc.description.references	Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5), 513-523. doi:10.1016/0306-4573(88)90021-0	es_ES
dc.description.references	Zaghouani, W. and Charfi, A. (2018a). ArapTweet: A large MultiDialect Twitter corpus for gender, age and language variety identification. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC), Miyazaki, Japan.	es_ES
dc.description.references	Zampieri, M. , Tan, L. , Ljubešić, N. , Tiedemann, J. and Nakov, P. (2015). Overview of the DSL shared task 2015. In Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects, pp. 1–9.	es_ES
dc.description.references	Huang, C.-R. and Lee, L.-H. (2008). Contrastive approach towards text source classification based on top-bag-of-word similarity. In PACLIC, pp. 404–410.	es_ES
dc.description.references	Zaidan, O. F., & Callison-Burch, C. (2014). Arabic Dialect Identification. Computational Linguistics, 40(1), 171-202. doi:10.1162/coli_a_00169	es_ES
dc.description.references	Grouin, C. , Forest, D. , Paroubek, P. and Zweigenbaum, P. (2011). Présentation et résultats du défi fouille de texte DEFT2011 Quand un article de presse a t-il été écrit? À quel article scientifique correspond ce résumé? Actes du septième Défi Fouille de Textes, p. 3.	es_ES
dc.description.references	Martinc, M. , Skrjanec, I. , Zupan, K. and Pollak, S. Pan (2017). Author profiling – gender and language variety prediction. In Cappellato L., Ferro N., Goeuriot L. and Mandl T. (eds), CLEF 2017 Working Notes. CEUR Workshop Proceedings (CEUR-WS.org), ISSN 1613-0073, http://ceur-ws.org/Vol-/. CLEF and CEUR-WS.org.	es_ES
dc.description.references	Rangel, F. , Rosso, P. and Franco-Salvador, M. (2016b). A low dimensionality representation for language variety identification. In 17th International Conference on Intelligent Text Processing and Computational Linguistics, CICLing, LNCS. Springer-Verlag, arxiv:1705.10754.	es_ES
dc.description.references	Hagen, M. , Potthast, M. and Stein, B. (2018). Overview of the Author Obfuscation Task at PAN 2018. CLEF 2018 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings. CEUR-WS.org.	es_ES
dc.description.references	Zampieri, M. and Gebre, B.G. (2012). Automatic identification of language varieties: The case of portuguese. In The 11th Conference on Natural Language Processing (KONVENS), pp. 233–237 (2012)	es_ES
dc.description.references	Rangel, F. , Rosso, P. , Montes-y-Gómez, M. , Potthast, M. and Stein, B. (2018). Overview of the 6th Author Profiling Task at PAN 2018: Multimodal Gender Identification in Twitter. In CLEF 2018 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings. CEUR-WS.org.	es_ES
dc.description.references	Heitele, D. (1975). An epistemological view on fundamental stochastic ideas. Educational Studies in Mathematics, 6(2), 187-205. doi:10.1007/bf00302543	es_ES
dc.description.references	Inches, G. and Crestani, F. (2012). Overview of the International Sexual Predator Identification Competition at PAN-2012. CLEF Online working notes/labs/workshop, vol. 30.	es_ES
dc.description.references	Rosso, P. , Rangel Pardo, F.M. , Ghanem, B. and Charfi, A. (2018b). ARAP: Arabic Author Profiling Project for Cyber-Security. Sociedad Española para el Procesamiento del Lenguaje Natural (SEPLN).	es_ES
dc.description.references	Agić, Ž. , Tiedemann, J. , Dobrovoljc, K. , Krek, S. , Merkler, D. , Može, S. , Nakov, P. , Osenova, P. and Vertan, C. (2014). Proceedings of the EMNLP 2014 Workshop on Language Technology for Closely Related Languages and Language Variants. Association for Computational Linguistics.	es_ES
dc.description.references	Sadat, F., Kazemi, F., & Farzindar, A. (2014). Automatic Identification of Arabic Language Varieties and Dialects in Social Media. Proceedings of the Second Workshop on Natural Language Processing for Social Media (SocialNLP). doi:10.3115/v1/w14-5904	es_ES
dc.description.references	Franco-Salvador, M., Rangel, F., Rosso, P., Taulé, M., & Antònia Martít, M. (2015). Language Variety Identification Using Distributed Representations of Words and Documents. Experimental IR Meets Multilinguality, Multimodality, and Interaction, 28-40. doi:10.1007/978-3-319-24027-5_3	es_ES
dc.description.references	Rosso, P., Rangel, F., Farías, I. H., Cagnina, L., Zaghouani, W., & Charfi, A. (2018). A survey on author profiling, deception, and irony detection for the Arabic language. Language and Linguistics Compass, 12(4), e12275. doi:10.1111/lnc3.12275	es_ES
dc.description.references	Malmasi, S. , Zampieri, M. , Ljubešić, N. , Nakov, P. , Ali, A. and Tiedemann, J. (2016). Discriminating between similar languages and arabic dialect identification: A report on the third DSL shared task. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pp. 1–14.	es_ES
dc.description.references	Rangel, F. , Rosso, P. , Potthast, M. and Stein, B. (2017). Overview of the 5th Author Profiling Task at PAN 2017: Gender and Language Variety Identification in Twitter. In Cappellato L., Ferro N., Goeuriot, L. and Mandl T. (eds), Working Notes Papers of the CLEF 2017 Evaluation Labs, p. 1613–0073, CLEF and CEUR-WS.org.	es_ES
dc.description.references	Zampieri, M. , Malmasi, S. , Ljubešić, N. , Nakov, P. , Ali, A. , Tiedemann, J. , Scherrer, Y. , Aepli, N. (2017). Findings of the vardial evaluation campaign 2017. In Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects, pp. 1–15.	es_ES
dc.description.references	Bogdanova, D., Rosso, P., & Solorio, T. (2014). Exploring high-level features for detecting cyberpedophilia. Computer Speech & Language, 28(1), 108-120. doi:10.1016/j.csl.2013.04.007	es_ES
dc.description.references	Maier, W. and Gómez-Rodríguez, C. (2014). Language Variety Identification in Spanish Tweets. LT4CloseLang.	es_ES
dc.description.references	Castro, D. , Souza, E. , de Oliveira, A.L.I. (2016). Discriminating between Brazilian and European Portuguese national varieties on Twitter texts. In 5th Brazilian Conference on Intelligent Systems (BRACIS), pp. 265–270.	es_ES
dc.description.references	Zaghouani, W. and Charfi, A. (2018b). Guidelines and annotation framework for Arabic author profiling. In Proceedings of the 3rd Workshop on Open-Source Arabic Corpora and Processing Tools, 11th International Conference on Language Resources and Evaluation (LREC), Miyazaki, Japan.	es_ES
dc.description.references	Hernández Fusilier, D., Montes-y-Gómez, M., Rosso, P., & Guzmán Cabrera, R. (2015). Detecting positive and negative deceptive opinions using PU-learning. Information Processing & Management, 51(4), 433-443. doi:10.1016/j.ipm.2014.11.001	es_ES
dc.description.references	Tellez, E.S. , Miranda-Jiménez, S. , Graff, M. and Moctezuma, D. (2017). Gender and language variety identification with microtc. In Cappellato L., Ferro N., Goeuriot L. and Mandl T. (eds). CLEF 2017 Working Notes. CEUR Workshop Proceedings (CEUR-WS.org), ISSN 1613-0073, http://ceur-ws.org/Vol-/. CLEF and CEUR-WS.org.	es_ES
dc.description.references	Kandias, M., Stavrou, V., Bozovic, N., & Gritzalis, D. (2013). Proactive insider threat detection through social media. Proceedings of the 12th ACM workshop on Workshop on privacy in the electronic society. doi:10.1145/2517840.2517865	es_ES

Este ítem aparece en la(s) siguiente(s) colección(ones)

Artículos, conferencias, monografías [45985]

Mostrar el registro sencillo del ítem

Fine-Grained Analysis of Language Varieties and Demographics

RiuNet: Repositorio Institucional de la Universidad Politécnica de Valencia

Buscar en RiuNet

Listar

Todo RiuNet

Esta colección

Mi cuenta

Estadísticas

Ayuda RiuNet

Admin. UPV

Compartir/Enviar a

Citas

Estadísticas

Fine-Grained Analysis of Language Varieties and Demographics

Ficheros en el ítem

Este ítem aparece en la(s) siguiente(s) colección(ones)