Mostrar el registro sencillo del ítem
dc.contributor.author | Aravamuthan, Sarang | es_ES |
dc.contributor.author | Jogalekar, Prasad | es_ES |
dc.contributor.author | Lee, Jonghae | es_ES |
dc.date.accessioned | 2023-01-09T08:27:16Z | |
dc.date.available | 2023-01-09T08:27:16Z | |
dc.date.issued | 2022-11-23 | |
dc.identifier.uri | http://hdl.handle.net/10251/191101 | |
dc.description.abstract | [EN] We address class imbalance problems. These are classification problems where the target variable is binary, and one class dominates over the other. A central objective in these problems is to identify features that yield models with high precision/recall values, the standard yardsticks for assessing such models. Our features are extracted from the textual data inherent in such problems. We use n-gram frequencies as features and introduce a discrepancy score that measures the efficacy of an n-gram in highlighting the minority class. The frequency counts of n-grams with the highest discrepancy scores are used as features to construct models with the desired metrics. According to the best practices followed by the services industry, many customer support tickets will get audited and tagged as contract-compliant whereas some will be tagged as over-delivered . Based on in-field data, we use a random forest classifier and perform a randomized grid search over the model hyperparameters. The model scoring is performed using an scoring function. Our objective is to minimize the follow-up costs by optimizing the recall score while maintaining a base-level precision score. The final optimized model achieves an acceptable recall score while staying above the target precision. We validate our feature selection method by comparing our model with one constructed using frequency counts of n-grams chosen randomly. We propose extensions of our feature extraction method to general classification (binary and multi-class) and regression problems. The discrepancy score is one measure of dissimilarity of distributions and other (more general) measures that we formulate could potentially yield more effective models. | es_ES |
dc.language | Inglés | es_ES |
dc.publisher | Universitat Politècnica de València | es_ES |
dc.relation.ispartof | Journal of Computer-Assisted Linguistic Research | es_ES |
dc.rights | Reconocimiento - No comercial - Sin obra derivada (by-nc-nd) | es_ES |
dc.subject | Class imbalance | es_ES |
dc.subject | Feature selection | es_ES |
dc.subject | N-gram frequency | es_ES |
dc.subject | NLP techniques | es_ES |
dc.subject | Random forest classifier | es_ES |
dc.title | Extracting Features from Textual Data in Class Imbalance Problems | es_ES |
dc.type | Artículo | es_ES |
dc.identifier.doi | 10.4995/jclr.2022.18200 | |
dc.rights.accessRights | Abierto | es_ES |
dc.description.bibliographicCitation | Aravamuthan, S.; Jogalekar, P.; Lee, J. (2022). Extracting Features from Textual Data in Class Imbalance Problems. Journal of Computer-Assisted Linguistic Research. 6:42-58. https://doi.org/10.4995/jclr.2022.18200 | es_ES |
dc.description.accrualMethod | OJS | es_ES |
dc.relation.publisherversion | https://doi.org/10.4995/jclr.2022.18200 | es_ES |
dc.description.upvformatpinicio | 42 | es_ES |
dc.description.upvformatpfin | 58 | es_ES |
dc.type.version | info:eu-repo/semantics/publishedVersion | es_ES |
dc.description.volume | 6 | es_ES |
dc.identifier.eissn | 2530-9455 | |
dc.relation.pasarela | OJS\18200 | es_ES |
dc.description.references | Batuwita, Rukshan, and Vasile Palade. 2010. "FSVM-CIL: Fuzzy Support Vector Machines for Class Imbalance Learning." IEEE Transactions on Fuzzy Systems 18: 558-571. https://doi.org/10.1109/TFUZZ.2010.2042721 | es_ES |
dc.description.references | Bi, Jingjun, and Chongsheng Zhang. 2018. "An empirical comparison on state-of-the-art multi-class imbalance learning algorithms and a new diversified ensemble learning scheme." Knowledge-Based Systems 158: 81-93. https://doi.org/10.1016/j.knosys.2018.05.037 | es_ES |
dc.description.references | Brownlee, Jason. 2020. "Imbalanced Classification with Python: Better Metrics, Balance Skewed Classes, Cost-Sensitive Learning." Machine Learning Mastery. https://books.google.com/books/about/Imbalanced_Classification_with_Python.html?id=jaXJDwAAQBAJ | es_ES |
dc.description.references | Chawla, Nitesh V. 2009. "Data Mining for Imbalanced Datasets: An Overview." In Data Mining and Knowledge Discovery Handbook, edited by O. Maimon and L. Rokach, Springer, Boston, MA. https://doi.org/10.1007/978-0-387-09823-4_45 | es_ES |
dc.description.references | He, Haibo, and Edwardo A. Garcia. 2009. "Learning from Imbalanced Data." IEEE Transactions on Knowledge and Data Engineering 21: 1263-1284. https://doi.org/10.1109/TKDE.2008.239 | es_ES |
dc.description.references | Ho, Tin K., and M. Basu. 2002. "Complexity measures of supervised classification problems." IEEE Transactions on Pattern Analysis and Machine Intelligence 24: 289-300. https://doi.org/10.1109/34.990132 | es_ES |
dc.description.references | Liu, Xu-Ling, Jianxin Wu, and Zhi-Hua Zhou. 2009. "Exploratory Undersampling for Class-Imbalance Learning." IEEE Transactions on Systems, Man and Cybernetics-Part B: Cybernetics 39: 539-550. https://doi.org/10.1109/TSMCB.2008.2007853 | es_ES |
dc.description.references | Prati, Ronaldo C., Gustavo E.A.P.A. Batista and Maria C. Monard. 2004. "Class imbalances versus class overlapping: an analysis of a learning system behavior." 4th Mexican International Conference on Artificial Intelligence. LNCS, Mexico City, 2972: 312-321. https://doi.org/10.1007/978-3-540-24694-7_32 | es_ES |
dc.description.references | Rivera, Gilberto, Rogelio Florencia, Vicente García, Alejandro Ruiz, and J. Patricia Sánchez-Solís. 2020. "News Classification for Identifying Traffic Incident Points in a Spanish-Speaking Country: A Real-World Case Study of Class Imbalance Learning." Applied Sciences 10, 6253. https://doi.org/10.3390/app10186253 | es_ES |
dc.description.references | Santos, Miriam S, Jastin Pompeu Soares, Pedro Henriques Abreu, Hélder Araújo and João Santos. 2018. "Cross-Validation for Imbalanced Datasets: Avoiding Overoptimistic and Overfitting Approaches [Research Frontier]." IEEE Computational Intelligence Magazine, 13: 59-76. https://doi.org/10.1109/MCI.2018.2866730 | es_ES |
dc.description.references | Santos, Miriam S, Pedro Henriques Abreu, Nathalie Japkowicz, Alberto Fernández, and João Santos. 2023. "A unifying view of class overlap and imbalance: Key concepts, multi-view panorama, and open avenues for research." Information Fusion 89: 228-253. https://doi.org/10.1016/j.inffus.2022.08.017 | es_ES |
dc.description.references | Sarmanova, Akkenzhe, and Songül Albayrak. 2013. "Alleviating Class Imbalance Problem In Data Mining." 21st Signal Processing and Communications Applications Conference (SIU) 1-4. https://doi.org/10.1109/SIU.2013.6531574 | es_ES |
dc.description.references | Soda, Paolo. 2011. "A multi-objective optimisation approach for class imbalance learning." Pattern Recognition 44: 1801-1810. https://doi.org/10.1016/j.patcog.2011.01.015 | es_ES |
dc.description.references | Sotiropoulos, Dionysios, Christos Giannoulis, and George A. Tsihrintzis. 2014 "A comparative study of one-class classifiers in machine learning problems with extreme class imbalance." The 5th International Conference on Information, Intelligence, Systems and Applications 362-364. https://doi.org/10.1109/IISA.2014.6878723 | es_ES |
dc.description.references | Tahvili, Sahar, Leo Hatvani, Enislay Ramentol, Rita Pimentel, Wasif Afzal, and Francisco Herrera. 2020. "A novel methodology to classify test cases using natural language processing and imbalanced learning." Engineering Applications of Artificial Intelligence, 95, 103878. https://doi.org/10.1016/j.engappai.2020.103878 | es_ES |
dc.description.references | Wang, Shuo, Leandro L. Minku, and Xin Yao. 2015. "Resampling-Based Ensemble Methods for Online Class Imbalance Learning." IEEE Transactions on Knowledge and Data Engineering 27: 1356-1368. https://doi.org/10.1109/TKDE.2014.2345380 | es_ES |
dc.description.references | Wang, Shuo, Leandro L. Minku, and Xin Yao. 2018. "A Systematic Study of Online Class Imbalance Learning With Concept Drift." IEEE Transactions on Neural Networks and Learning Systems 29: 4802-4821. https://doi.org/10.1109/TNNLS.2017.2771290 | es_ES |
dc.description.references | Wang, Shuo, and Xin Yao. 2013. "Using Class Imbalance Learning for Software Defect Prediction." IEEE Transactions on Reliability 62: 434-443. https://doi.org/10.1109/TR.2013.2259203 | es_ES |
dc.description.references | Zhang, Chongsheng, Jingjun Bi, Shixin Xu, Enislay Ramentol, Gaojuan Fan, Baojun Qiao, and Hamido Fujita. 2019. "Multi-Imbalance: An open-source software for multi-class imbalance learning." Knowledge-Based Systems 174: 137-143. https://doi.org/10.1016/j.knosys.2019.03.001 | es_ES |