- -

A Low Dimensionality Representation for Language Variety Identification

RiuNet: Repositorio Institucional de la Universidad Politécnica de Valencia

Compartir/Enviar a

Citas

Estadísticas

  • Estadisticas de Uso

A Low Dimensionality Representation for Language Variety Identification

Mostrar el registro sencillo del ítem

Ficheros en el ítem

dc.contributor.author Rangel-Pardo, Francisco Manuel es_ES
dc.contributor.author Franco-Salvador, Marc es_ES
dc.contributor.author Rosso, Paolo es_ES
dc.date.accessioned 2020-06-12T03:34:04Z
dc.date.available 2020-06-12T03:34:04Z
dc.date.issued 2018 es_ES
dc.identifier.issn 0302-9743 es_ES
dc.identifier.uri http://hdl.handle.net/10251/146184
dc.description.abstract [EN] Language variety identification aims at labelling texts in a native language (e.g. Spanish, Portuguese, English) with its specific variation (e.g. Argentina, Chile, Mexico, Peru, Spain; Brazil, Portugal; UK, US). In this work we propose a low dimensionality representation (LDR) to address this task with five different varieties of Spanish: Argentina, Chile, Mexico, Peru and Spain. We compare our LDR method with common state-of-the-art representations and show an increase in accuracy of ~35%. Furthermore, we compare LDR with two reference distributed representation models. Experimental results show competitive performance while dramatically reducing the dimensionality¿and increasing the big data suitability¿to only 6 features per variety. Additionally, we analyse the behaviour of the employed machine learning algorithms and the most discriminating features. Finally, we employ an alternative dataset to test the robustness of our low dimensionality representation with another set of similar languages. es_ES
dc.description.sponsorship The work of the first author was in the framework of ECOPORTUNITY IPT-2012-1220-430000. The work of the last two authors was in the framework of the SomEMBED MINECO TIN2015-71147-C2-1-P research project. This work has been also supported by the SomEMBED TIN2015-71147-C2-1-P MINECO research project and by the Generalitat Valenciana under the grant ALMAPATER (PrometeoII/2014/030). es_ES
dc.language Inglés es_ES
dc.publisher Springer-Verlag es_ES
dc.relation.ispartof Lecture Notes in Computer Science es_ES
dc.relation.ispartof Computational Linguistics and Intelligent Text Processing es_ES
dc.rights Reserva de todos los derechos es_ES
dc.subject Low dimensionality representation es_ES
dc.subject Language variety identification es_ES
dc.subject Similar languages discrimination es_ES
dc.subject Author profiling es_ES
dc.subject Big data es_ES
dc.subject Social media es_ES
dc.subject.classification LENGUAJES Y SISTEMAS INFORMATICOS es_ES
dc.title A Low Dimensionality Representation for Language Variety Identification es_ES
dc.type Artículo es_ES
dc.type Comunicación en congreso es_ES
dc.type Capítulo de libro es_ES
dc.identifier.doi 10.1007/978-3-319-75487-1_13 es_ES
dc.relation.projectID info:eu-repo/grantAgreement/GVA//PROMETEOII%2F2014%2F030/ES/ Adaptive learning and multimodality in machine translation and text transcription/ es_ES
dc.relation.projectID info:eu-repo/grantAgreement/MINECO//TIN2015-71147-C2-1-P/ES/COMPRENSION DEL LENGUAJE EN LOS MEDIOS DE COMUNICACION SOCIAL - REPRESENTANDO CONTEXTOS DE FORMA CONTINUA/ es_ES
dc.rights.accessRights Abierto es_ES
dc.contributor.affiliation Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació es_ES
dc.description.bibliographicCitation Rangel-Pardo, FM.; Franco-Salvador, M.; Rosso, P. (2018). A Low Dimensionality Representation for Language Variety Identification. Lecture Notes in Computer Science. 9624:156-169. https://doi.org/10.1007/978-3-319-75487-1_13 es_ES
dc.description.accrualMethod S es_ES
dc.relation.conferencename 17th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing 2016) es_ES
dc.relation.conferencedate Abril 03-09,2016 es_ES
dc.relation.conferenceplace Konya, Turquía es_ES
dc.relation.publisherversion https://doi.org/10.1007/978-3-319-75487-1_13 es_ES
dc.description.upvformatpinicio 156 es_ES
dc.description.upvformatpfin 169 es_ES
dc.type.version info:eu-repo/semantics/publishedVersion es_ES
dc.description.volume 9624 es_ES
dc.relation.pasarela S\384295 es_ES
dc.contributor.funder Ministerio de Economía y Empresa es_ES
dc.contributor.funder Generalitat Valenciana es_ES
dc.description.references Franco-Salvador, M., Rangel, F., Rosso, P., Taulé, M., Antònia Martít, M.: Language variety identification using distributed representations of words and documents. In: Mothe, J., Savoy, J., Kamps, J., Pinel-Sauvagnat, K., Jones, G.J.F., SanJuan, E., Cappellato, L., Ferro, N. (eds.) CLEF 2015. LNCS, vol. 9283, pp. 28–40. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24027-5_3 es_ES
dc.description.references Goodman, J.: Classes for fast maximum entropy training. In: Proceedings of the Acoustics, Speech, and Signal Processing (ICASSP 2001), vol. 1, pp. 561–564 (2001) es_ES
dc.description.references Gutmann, M.U., Hyvärinen, A.: Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. J. Mach. Learn. Res. 13, 307–361 (2012) es_ES
dc.description.references Hinton, G.E., Mcclelland, J.L., Rumelhart, D.E.: Distributed Representations, Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Foundations, vol. 1. MIT Press, Cambridge (1986) es_ES
dc.description.references Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on Machine Learning (ICML 2014), vol. 32 (2014) es_ES
dc.description.references Maier, W., Gómez-Rodríguez, C.: Language variety identification in Spanish tweets. In: Workshop on Language Technology for Closely Related Languages and Language Variants (EMNLP 2014), pp. 25–35 (2014) es_ES
dc.description.references Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings of Workshop at International Conference on Learning Representations (ICLR 2013) (2013) es_ES
dc.description.references Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013) es_ES
dc.description.references Mnih, A., Teh, Y.W.: A fast and simple algorithm for training neural probabilistic language models. In: Proceedings of the 29th International Conference on Machine Learning (ICML 2012), pp. 1751–1758 (2012) es_ES
dc.description.references Sadat, F., Kazemi, F., Farzindar, A.: Automatic identification of Arabic language varieties and dialects in social media. In: 1st International Workshop on Social Media Retrieval and Analysis (SoMeRa 2014) (2014) es_ES
dc.description.references Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24(5), 513–523 (1988) es_ES
dc.description.references Tan, L., Zampieri, M., Ljubešic, N., Tiedemann, J.: Merging comparable data sources for the discrimination of similar languages: the DSL corpus collection. In: 7th Workshop on Building and Using Comparable Corpora Building Resources for Machine Translation Research (BUCC 2014), pp. 6–10 (2014) es_ES
dc.description.references Zampieri, M., Gebrekidan-Gebre, B.: Automatic identification of language varieties: the case of Portuguese. In: Proceedings of the 11th Conference on Natural Language Processing (KONVENS 2012), pp. 233–237 (2012) es_ES
dc.description.references Zampieri, M., Tan, L., Ljubeši, N., Tiedemann, J.: A report on the DSL shared task 2014. In: Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects (VarDial 2014), pp. 58–67 (2014) es_ES


Este ítem aparece en la(s) siguiente(s) colección(ones)

Mostrar el registro sencillo del ítem