Extracting Features from Textual Data in Class Imbalance Problems

Aravamuthan, Sarang; Jogalekar, Prasad; Lee, Jonghae

doi:10.4995/jclr.2022.18200

RiuNet repositorio UPV
:
Investigación
:
Material investigación. Editorial UPV
:
Revistas UPV. Editorial UPV
:
Journal of Computer-Assisted Linguistic Research
:
Journal of Computer-Assisted Linguistic Research - Vol 06 (2022)
:
Ver ítem

Identificarse

Buscar en RiuNet

Listar

Todo RiuNet
Esta colección

Mi cuenta

Acceder

Estadísticas

Ver Estadísticas de uso

Ayuda RiuNet

Admin. UPV

Compartir/Enviar a

Citas

Estadísticas

Extracting Features from Textual Data in Class Imbalance Problems

Mostrar el registro sencillo del ítem

Ficheros en el ítem

Nombre: AravamuthanJogale ...

Tamaño: 1.202Mb

Formato: PDF

Descripción: Versión editorial

Abrir

dc.contributor.author	Aravamuthan, Sarang	es_ES
dc.contributor.author	Jogalekar, Prasad	es_ES
dc.contributor.author	Lee, Jonghae	es_ES
dc.date.accessioned	2023-01-09T08:27:16Z
dc.date.available	2023-01-09T08:27:16Z
dc.date.issued	2022-11-23
dc.identifier.uri	http://hdl.handle.net/10251/191101
dc.description.abstract	[EN] We address class imbalance problems. These are classification problems where the target variable is binary, and one class dominates over the other. A central objective in these problems is to identify features that yield models with high precision/recall values, the standard yardsticks for assessing such models. Our features are extracted from the textual data inherent in such problems. We use n-gram frequencies as features and introduce a discrepancy score that measures the efficacy of an n-gram in highlighting the minority class. The frequency counts of n-grams with the highest discrepancy scores are used as features to construct models with the desired metrics. According to the best practices followed by the services industry, many customer support tickets will get audited and tagged as contract-compliant whereas some will be tagged as over-delivered . Based on in-field data, we use a random forest classifier and perform a randomized grid search over the model hyperparameters. The model scoring is performed using an scoring function. Our objective is to minimize the follow-up costs by optimizing the recall score while maintaining a base-level precision score. The final optimized model achieves an acceptable recall score while staying above the target precision. We validate our feature selection method by comparing our model with one constructed using frequency counts of n-grams chosen randomly. We propose extensions of our feature extraction method to general classification (binary and multi-class) and regression problems. The discrepancy score is one measure of dissimilarity of distributions and other (more general) measures that we formulate could potentially yield more effective models.	es_ES
dc.language	Inglés	es_ES
dc.publisher	Universitat Politècnica de València	es_ES
dc.relation.ispartof	Journal of Computer-Assisted Linguistic Research	es_ES
dc.rights	Reconocimiento - No comercial - Sin obra derivada (by-nc-nd)	es_ES
dc.subject	Class imbalance	es_ES
dc.subject	Feature selection	es_ES
dc.subject	N-gram frequency	es_ES
dc.subject	NLP techniques	es_ES
dc.subject	Random forest classifier	es_ES
dc.title	Extracting Features from Textual Data in Class Imbalance Problems	es_ES
dc.type	Artículo	es_ES
dc.identifier.doi	10.4995/jclr.2022.18200
dc.rights.accessRights	Abierto	es_ES
dc.description.bibliographicCitation	Aravamuthan, S.; Jogalekar, P.; Lee, J. (2022). Extracting Features from Textual Data in Class Imbalance Problems. Journal of Computer-Assisted Linguistic Research. 6:42-58. https://doi.org/10.4995/jclr.2022.18200	es_ES
dc.description.accrualMethod	OJS	es_ES
dc.relation.publisherversion	https://doi.org/10.4995/jclr.2022.18200	es_ES
dc.description.upvformatpinicio	42	es_ES
dc.description.upvformatpfin	58	es_ES
dc.type.version	info:eu-repo/semantics/publishedVersion	es_ES
dc.description.volume	6	es_ES
dc.identifier.eissn	2530-9455
dc.relation.pasarela	OJS\18200	es_ES
dc.description.references	Batuwita, Rukshan, and Vasile Palade. 2010. "FSVM-CIL: Fuzzy Support Vector Machines for Class Imbalance Learning." IEEE Transactions on Fuzzy Systems 18: 558-571. https://doi.org/10.1109/TFUZZ.2010.2042721	es_ES
dc.description.references	Bi, Jingjun, and Chongsheng Zhang. 2018. "An empirical comparison on state-of-the-art multi-class imbalance learning algorithms and a new diversified ensemble learning scheme." Knowledge-Based Systems 158: 81-93. https://doi.org/10.1016/j.knosys.2018.05.037	es_ES
dc.description.references	Brownlee, Jason. 2020. "Imbalanced Classification with Python: Better Metrics, Balance Skewed Classes, Cost-Sensitive Learning." Machine Learning Mastery. https://books.google.com/books/about/Imbalanced_Classification_with_Python.html?id=jaXJDwAAQBAJ	es_ES
dc.description.references	Chawla, Nitesh V. 2009. "Data Mining for Imbalanced Datasets: An Overview." In Data Mining and Knowledge Discovery Handbook, edited by O. Maimon and L. Rokach, Springer, Boston, MA. https://doi.org/10.1007/978-0-387-09823-4_45	es_ES
dc.description.references	He, Haibo, and Edwardo A. Garcia. 2009. "Learning from Imbalanced Data." IEEE Transactions on Knowledge and Data Engineering 21: 1263-1284. https://doi.org/10.1109/TKDE.2008.239	es_ES
dc.description.references	Ho, Tin K., and M. Basu. 2002. "Complexity measures of supervised classification problems." IEEE Transactions on Pattern Analysis and Machine Intelligence 24: 289-300. https://doi.org/10.1109/34.990132	es_ES
dc.description.references	Liu, Xu-Ling, Jianxin Wu, and Zhi-Hua Zhou. 2009. "Exploratory Undersampling for Class-Imbalance Learning." IEEE Transactions on Systems, Man and Cybernetics-Part B: Cybernetics 39: 539-550. https://doi.org/10.1109/TSMCB.2008.2007853	es_ES
dc.description.references	Prati, Ronaldo C., Gustavo E.A.P.A. Batista and Maria C. Monard. 2004. "Class imbalances versus class overlapping: an analysis of a learning system behavior." 4th Mexican International Conference on Artificial Intelligence. LNCS, Mexico City, 2972: 312-321. https://doi.org/10.1007/978-3-540-24694-7_32	es_ES
dc.description.references	Rivera, Gilberto, Rogelio Florencia, Vicente García, Alejandro Ruiz, and J. Patricia Sánchez-Solís. 2020. "News Classification for Identifying Traffic Incident Points in a Spanish-Speaking Country: A Real-World Case Study of Class Imbalance Learning." Applied Sciences 10, 6253. https://doi.org/10.3390/app10186253	es_ES
dc.description.references	Santos, Miriam S, Jastin Pompeu Soares, Pedro Henriques Abreu, Hélder Araújo and João Santos. 2018. "Cross-Validation for Imbalanced Datasets: Avoiding Overoptimistic and Overfitting Approaches [Research Frontier]." IEEE Computational Intelligence Magazine, 13: 59-76. https://doi.org/10.1109/MCI.2018.2866730	es_ES
dc.description.references	Santos, Miriam S, Pedro Henriques Abreu, Nathalie Japkowicz, Alberto Fernández, and João Santos. 2023. "A unifying view of class overlap and imbalance: Key concepts, multi-view panorama, and open avenues for research." Information Fusion 89: 228-253. https://doi.org/10.1016/j.inffus.2022.08.017	es_ES
dc.description.references	Sarmanova, Akkenzhe, and Songül Albayrak. 2013. "Alleviating Class Imbalance Problem In Data Mining." 21st Signal Processing and Communications Applications Conference (SIU) 1-4. https://doi.org/10.1109/SIU.2013.6531574	es_ES
dc.description.references	Soda, Paolo. 2011. "A multi-objective optimisation approach for class imbalance learning." Pattern Recognition 44: 1801-1810. https://doi.org/10.1016/j.patcog.2011.01.015	es_ES
dc.description.references	Sotiropoulos, Dionysios, Christos Giannoulis, and George A. Tsihrintzis. 2014 "A comparative study of one-class classifiers in machine learning problems with extreme class imbalance." The 5th International Conference on Information, Intelligence, Systems and Applications 362-364. https://doi.org/10.1109/IISA.2014.6878723	es_ES
dc.description.references	Tahvili, Sahar, Leo Hatvani, Enislay Ramentol, Rita Pimentel, Wasif Afzal, and Francisco Herrera. 2020. "A novel methodology to classify test cases using natural language processing and imbalanced learning." Engineering Applications of Artificial Intelligence, 95, 103878. https://doi.org/10.1016/j.engappai.2020.103878	es_ES
dc.description.references	Wang, Shuo, Leandro L. Minku, and Xin Yao. 2015. "Resampling-Based Ensemble Methods for Online Class Imbalance Learning." IEEE Transactions on Knowledge and Data Engineering 27: 1356-1368. https://doi.org/10.1109/TKDE.2014.2345380	es_ES
dc.description.references	Wang, Shuo, Leandro L. Minku, and Xin Yao. 2018. "A Systematic Study of Online Class Imbalance Learning With Concept Drift." IEEE Transactions on Neural Networks and Learning Systems 29: 4802-4821. https://doi.org/10.1109/TNNLS.2017.2771290	es_ES
dc.description.references	Wang, Shuo, and Xin Yao. 2013. "Using Class Imbalance Learning for Software Defect Prediction." IEEE Transactions on Reliability 62: 434-443. https://doi.org/10.1109/TR.2013.2259203	es_ES
dc.description.references	Zhang, Chongsheng, Jingjun Bi, Shixin Xu, Enislay Ramentol, Gaojuan Fan, Baojun Qiao, and Hamido Fujita. 2019. "Multi-Imbalance: An open-source software for multi-class imbalance learning." Knowledge-Based Systems 174: 137-143. https://doi.org/10.1016/j.knosys.2019.03.001	es_ES

Este ítem aparece en la(s) siguiente(s) colección(ones)

Journal of Computer-Assisted Linguistic Research - Vol 06 (2022) [5]

Mostrar el registro sencillo del ítem

Extracting Features from Textual Data in Class Imbalance Problems

RiuNet: Repositorio Institucional de la Universidad Politécnica de Valencia

Buscar en RiuNet

Listar

Todo RiuNet

Esta colección

Mi cuenta

Estadísticas

Ayuda RiuNet

Admin. UPV

Compartir/Enviar a

Citas

Estadísticas

Extracting Features from Textual Data in Class Imbalance Problems

Ficheros en el ítem

Este ítem aparece en la(s) siguiente(s) colección(ones)