- -

Can language models automate data wrangling?

RiuNet: Repositorio Institucional de la Universidad Politécnica de Valencia

Compartir/Enviar a

Citas

Estadísticas

  • Estadisticas de Uso

Can language models automate data wrangling?

Mostrar el registro sencillo del ítem

Ficheros en el ítem

dc.contributor.author Jaimovitch-López, Gonzalo es_ES
dc.contributor.author Ferri Ramírez, César es_ES
dc.contributor.author Hernández-Orallo, José es_ES
dc.contributor.author Martínez-Plumed, Fernando es_ES
dc.contributor.author Ramírez Quintana, María José es_ES
dc.date.accessioned 2023-08-28T18:00:31Z
dc.date.available 2023-08-28T18:00:31Z
dc.date.issued 2023-06 es_ES
dc.identifier.issn 0885-6125 es_ES
dc.identifier.uri http://hdl.handle.net/10251/195751
dc.description.abstract [EN] The automation of data science and other data manipulation processes depend on the integration and formatting of 'messy' data. Data wrangling is an umbrella term for these tedious and time-consuming tasks. Tasks such as transforming dates, units or names expressed in different formats have been challenging for machine learning because (1) users expect to solve them with short cues or few examples, and (2) the problems depend heavily on domain knowledge. Interestingly, large language models today (1) can infer from very few examples or even a short clue in natural language, and (2) can integrate vast amounts of domain knowledge. It is then an important research question to analyse whether language models are a promising approach for data wrangling, especially as their capabilities continue growing. In this paper we apply different variants of the language model Generative Pre-trained Transformer (GPT) to five batteries covering a wide range of data wrangling problems. We compare the effect of prompts and few-shot regimes on their results and how they compare with specialised data wrangling systems and other tools. Our major finding is that they appear as a powerful tool for a wide range of data wrangling tasks. We provide some guidelines about how they can be integrated into data processing pipelines, provided the users can take advantage of their flexibility and the diversity of tasks to be addressed. However, reliability is still an important issue to overcome. es_ES
dc.description.sponsorship Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. This work was funded by the Future of Life Institute, FLI, under grant RFP2-152, the MIT-Spain - INDITEX Sustainability Seed Fund under project COST-OMIZE, the EU (FEDER) and Spanish MINECO under RTI2018-094403-B-C32 and PID2021-122830OB-C42, Generalitat Valenciana under PROMETEO/2019/098 and INNEST/2021/317, EU's Horizon 2020 research and innovation programme under grant agreement No. 952215 (TAILOR) and US DARPA HR00112120007 ReCOG-AI. AcknowledgementsWe thank Lidia Contreras for her help with the Data Wrangling Dataset Repository. We thank the anonymous reviewers from ECMLPKDD Workshop on Automating Data Science (ADS2021) and the anonymous reviewers of this special issue for their comments. es_ES
dc.language Inglés es_ES
dc.publisher Springer-Verlag es_ES
dc.relation.ispartof Machine Learning es_ES
dc.rights Reconocimiento (by) es_ES
dc.subject Data science automation es_ES
dc.subject Data wrangling es_ES
dc.subject Language models es_ES
dc.subject Machine learning pipelines es_ES
dc.subject.classification LENGUAJES Y SISTEMAS INFORMATICOS es_ES
dc.title Can language models automate data wrangling? es_ES
dc.type Artículo es_ES
dc.identifier.doi 10.1007/s10994-022-06259-9 es_ES
dc.relation.projectID info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020/RTI2018-094403-B-C32/ES/RAZONAMIENTO FORMAL PARA TECNOLOGIAS FACILITADORAS Y EMERGENTES/ es_ES
dc.relation.projectID info:eu-repo/grantAgreement/GVA//PROMETEO%2F2019%2F098//DEEPTRUST/ es_ES
dc.relation.projectID info:eu-repo/grantAgreement/EC/H2020/952215/EU es_ES
dc.relation.projectID info:eu-repo/grantAgreement/GVA//INNEST%2F2021%2F317/ es_ES
dc.relation.projectID info:eu-repo/grantAgreement/MINECO//PID2021-122830OB-C42/ es_ES
dc.relation.projectID info:eu-repo/grantAgreement/FLI//RFP2-152/ es_ES
dc.relation.projectID info:eu-repo/grantAgreement/DOD//HR00112120007/ es_ES
dc.relation.projectID info:eu-repo/grantAgreement/MIT//COST-OMIZE/ es_ES
dc.rights.accessRights Abierto es_ES
dc.contributor.affiliation Universitat Politècnica de València. Escola Tècnica Superior d'Enginyeria Informàtica es_ES
dc.description.bibliographicCitation Jaimovitch-López, G.; Ferri Ramírez, C.; Hernández-Orallo, J.; Martínez-Plumed, F.; Ramírez Quintana, MJ. (2023). Can language models automate data wrangling?. Machine Learning. 112(6):2053-2082. https://doi.org/10.1007/s10994-022-06259-9 es_ES
dc.description.accrualMethod S es_ES
dc.relation.publisherversion https://doi.org/10.1007/s10994-022-06259-9 es_ES
dc.description.upvformatpinicio 2053 es_ES
dc.description.upvformatpfin 2082 es_ES
dc.type.version info:eu-repo/semantics/publishedVersion es_ES
dc.description.volume 112 es_ES
dc.description.issue 6 es_ES
dc.relation.pasarela S\488506 es_ES
dc.contributor.funder European Commission es_ES
dc.contributor.funder Generalitat Valenciana es_ES
dc.contributor.funder Future of Life Institute es_ES
dc.contributor.funder U.S. Department of Defense es_ES
dc.contributor.funder Agencia Estatal de Investigación es_ES
dc.contributor.funder European Regional Development Fund es_ES
dc.contributor.funder Massachusetts Institute of Technology es_ES
dc.contributor.funder Universitat Politècnica de València es_ES
dc.contributor.funder Ministerio de Economía y Competitividad es_ES
dc.description.references Ashok, P., & Nawaz, G. K. (2016). Outlier detection method on uci repository dataset by entropy based rough k-means. Defence Science Journal, 66(2), 113–121. es_ES
dc.description.references Bellmann, P., & Schwenker, F. (2020). Ordinal classification: Working definition and detection of ordinal structures. IEEE Access, 8, 164380–164391. https://doi.org/10.1109/ACCESS.2020.3021596 es_ES
dc.description.references Ben-Gal, I. (2005). Outlier detection. In Data mining and knowledge discovery handbook (pp. 131–146). Springer. es_ES
dc.description.references Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (pp. 610–623). FAccT ’21. es_ES
dc.description.references Bengio, Y., Ducharme, R., Vincent, P., & Janvin, C. (2003). A neural probabilistic language model. The Journal of Machine Learning Research, 3, 1137–1155. es_ES
dc.description.references Bhupatiraju, S., Singh, R., Mohamed, A. R., & Kohli, P. (2017). Deep API programmer: Learning to program with APIs. arXiv preprint arXiv:1704.04327. es_ES
dc.description.references BIG-bench collaboration. (2022). Beyond the imitation game: Measuring and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615. https://github.com/google/BIG-bench/ es_ES
dc.description.references Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., & Agarwal, S. (2020). Language models are few-shot learners. arXiv preprint arXiv:2005.14165. es_ES
dc.description.references Chen, Y., Dang, X., Peng, H., & Bart, H. L. (2008). Outlier detection with the kernelized spatial depth function. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(2), 288–305. es_ES
dc.description.references Contreras-Ochando, L., Ferri, C., & Hernández-Orallo, J. (2019a). Automating common data science matrix transformations. In ECMLPKDD workshop on Automating Data Science. ECML-PKDD ’19. es_ES
dc.description.references Contreras-Ochando, L., Ferri, C., Hernández-Orallo, J., Martínez-Plumed, F., Ramírez-Quintana, M. J., & Katayama, S. (2019b). Automated data transformation with inductive programming and dynamic background knowledge. In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases, ECML PKDD 2019. ECML-PKDD ’19. es_ES
dc.description.references Cropper, A., Tamaddoni, A., & Muggleton, S. H. (2015). Meta-interpretive learning of data transformation programs. In Inductive Logic Programming (pp. 46–59). es_ES
dc.description.references Das, K., & Schneider, J. (2007). Detecting anomalous records in categorical datasets. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 220–229). es_ES
dc.description.references Das, K., Schneider, J., & Neill, D. B. (2008). Anomaly pattern detection in categorical datasets. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 169–176). es_ES
dc.description.references De Bie, T., De Raedt, L., Hernández-Orallo, J., Hoos, H. H., Smyth, P., & Williams, C. K. I. (2022). Automating data science: Prospects and challenges. Communications of the ACM, 65(3), 76–87. es_ES
dc.description.references Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. es_ES
dc.description.references Dua, D., & Graff, C. (2017). UCI machine learning repository. http://archive.ics.uci.edu/ml. es_ES
dc.description.references Ellis, K., & Gulwani, S. (2017). Learning to learn programs from examples: Going beyond program structure. In IJCAI (pp. 1638–1645). es_ES
dc.description.references Fernando, M. P., Cèsar, F., David, N., & José, H. O. (2021). Missing the missing values: The ugly duckling of fairness in machine learning. International Journal of Intelligent Systems, 36(7), 3217–3258. es_ES
dc.description.references Ferrari, A., & Russo, M. (2016). Introducing Microsoft Power BI. Microsoft Press. es_ES
dc.description.references Furche, T., Gottlob, G., Libkin, L., Orsi, G., & Paton, N. W. (2016). Data wrangling for big data. Challenges and opportunities. EDBT, 16, 473–478. es_ES
dc.description.references Gao, T., Fisch, A., & Chen, D. (2020). Making pre-trained language models better few-shot learners. arXiv preprint arXiv:2012.15723. es_ES
dc.description.references García, S., Ramírez-Gallego, S., Luengo, J., Benítez, J. M., & Herrera, F. (2016). Big data preprocessing: Methods and prospects. Big Data Analytics, 1(1), 1–22. es_ES
dc.description.references Gulwani, S. (2011). Automating string processing in spreadsheets using input-output examples. In Procs. 38th Principles of Programming Languages (pp. 317–330). es_ES
dc.description.references Gulwani, S., Hernández-Orallo, J., Kitzelmann, E., Muggleton, S. H., Schmid, U., & Zorn, B. (2015). Inductive programming meets the real world. Communications of the ACM, 58(11), 90–99. es_ES
dc.description.references Ham, K. (2013). OpenRefine (version 2.5). http://openrefine.org.free/ Open-source tool for cleaning and transforming data. Journal of the Medical Library Association: JMLA, 101 (3), 233. es_ES
dc.description.references He, Z., Xu, X., Huang, Z. J., & Deng, S. (2005). Fp-outlier: Frequent pattern based outlier detection. Computer Science and Information Systems, 2(1), 103–118. es_ES
dc.description.references Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., & Steinhardt, J. (2021). Measuring massive multitask language understanding. In ICLR. es_ES
dc.description.references Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., & Steinhardt, J. (2021). Measuring mathematical problem solving with the MATH dataset. In CoRR. arxiv:2103.03874. es_ES
dc.description.references Hulsebos, M., Hu, K., Bakker, M., Zgraggen, E., Satyanarayan, A., Kraska, T., Demiralp, Ç., & Hidalgo, C. (2019). Sherlock: A deep learning approach to semantic data type detection. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 1500–1508). es_ES
dc.description.references Izacard, G., & Grave, E. (2020). Leveraging passage retrieval with generative models for open domain question answering. arXiv preprint arXiv:2007.01282. es_ES
dc.description.references Jaimovitch-Lopez, G., Ferri, C., Hernandez-Orallo, J., Martinez-Plumed, F., & Ramirez-Quintana, M. J. (2021). Can language models automate data wrangling?. In ECML/PKDD Workshop on Automated Data Science (ADS2021). https://sites.google.com/view/autods. es_ES
dc.description.references Kandel, S., Paepcke, A., Hellerstein, J., & Heer, J. (2011). Wrangler: Interactive visual specification of data transformation scripts. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 3363–3372). ACM. es_ES
dc.description.references Lazarevic, A., & Kumar, V. (2005). Feature bagging for outlier detection. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining (pp. 157–166). es_ES
dc.description.references Lu, Y., Bartolo, M., Moore, A., Riedel, S., & Stenetorp, P. (2021). Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. arXiv preprint arXiv:2104.08786. es_ES
dc.description.references Nazabal, A., Williams, C. K., Colavizza, G., Smith, C. R., & Williams, A. (2020). Data engineering for data analytics: A classification of the issues, and case studies. arXiv preprint arXiv:2004.12929. es_ES
dc.description.references Noto, K., Brodley, C., & Slonim, D. (2012). Frac: A feature-modeling approach for semi-supervised and unsupervised anomaly detection. Data Mining and Knowledge Discovery, 25(1), 109–133. es_ES
dc.description.references Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830. es_ES
dc.description.references Petrova-Antonova, D., & Tancheva, R. (2020). Data cleaning: A case study with OpenRefine and Trifacta Wrangler. In International Conference on the Quality of Information and Communications Technology (pp. 32–40). Springer. es_ES
dc.description.references Porwal, U., & Mukund, S. (2017). Outlier detection by consistent data selection method. arXiv preprint arXiv:1712.04129. es_ES
dc.description.references Puri, R., & Catanzaro, B. (2019). Zero-shot text classification with generative language models. arXiv preprint arXiv:1912.10165. es_ES
dc.description.references Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI Blog, 1(8), 9. es_ES
dc.description.references Raman, V., & Hellerstein, J. M. (2001). Potter’s wheel: An interactive data cleaning system. In VLDB (Vol. 1, pp. 381–390). es_ES
dc.description.references Reed, S., Zolna, K., Parisotto, E., Colmenarejo, S. G., Novikov, A., Barth-Maron, G., Gimenez, M., Sulsky, Y., Kay, J., Springenberg, J. T., & Eccles, T. (2022). A generalist agent. arXiv preprint arXiv:2205.06175. es_ES
dc.description.references Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581–592. es_ES
dc.description.references Schick, T., & Schütze, H. (2020). Exploiting cloze questions for few-shot text classification and natural language inference. arXiv preprint arXiv:2001.07676. es_ES
dc.description.references Shannon, C. E. (1949). Communication theory of secrecy systems. The Bell System Technical Journal, 28(4), 656–715. es_ES
dc.description.references Shi, Y., Li, W., & Sha, F. (2016). Metric learning for ordinal data. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 30). es_ES
dc.description.references Singh, R., & Gulwani, S. (2015). Predicting a correct program in programming by example. In International Conference on Computer Aided Verification (pp. 398–414). Springer. es_ES
dc.description.references Singh, R., & Gulwani, S. (2016). Transforming spreadsheet data types using examples. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (pp. 343–356). es_ES
dc.description.references Sleeper, R. (2021). Tableau Desktop Pocket Reference. O’Reilly Media Inc. es_ES
dc.description.references Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., & Zhang, E. (2022). Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990. es_ES
dc.description.references Tamkin, A., Brundage, M., Clark, J., & Ganguli, D. (2021). Understanding the capabilities, limitations, and societal impact of large language models. arXiv preprint arXiv:2102.02503. es_ES
dc.description.references Terrizzano, I. G., Schwarz, P. M., Roth, M., & Colino, J. E. (2015). Data wrangling: The challenging journey from the wild to the lake. In CIDR. es_ES
dc.description.references Trifacta (2022): Trifacta Wrangler. https://www.trifacta.com es_ES
dc.description.references Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. arXiv preprint arXiv:1706.03762. es_ES
dc.description.references Wei, J., Bosma, M. P., Zhao, V., Guu, K., Yu, A. W., Lester, B., Du, N., Dai, A. M., & Le, Q. V. (2022). Finetuned language models are zero-shot learners. https://openreview.net/forum?id=gEZrGCozdqR es_ES
dc.description.references Wu, B., Szekely, P., & Knoblock, C. A. (2012). Learning data transformation rules through examples: Preliminary results. In Information Integration on the Web (p. 8). es_ES
dc.description.references Xu, S., Semnani, S. J., Campagna, G., & Lam, M. S. (2020). AutoQA: From databases to QA semantic parsers with only synthetic training data. In EMNLP. es_ES
dc.description.references Zeng, W., Ren, X., Su, T., Wang, H., Liao, Y., Wang, Z., Jiang, X., Yang, Z., Wang, K., Zhang, X., & Li, C. (2021). Pangu-$$\alpha$$: Large-scale autoregressive pretrained chinese language models with auto-parallel computation. arXiv preprint arXiv:2104.12369. es_ES
dc.description.references Zhang, D., Suhara, Y., Li, J., Hulsebos, M., Demiralp, Ç., & Tan, W. C. (2019). Sato: Contextual semantic type detection in tables. arXiv preprint arXiv:1911.06311. es_ES
dc.description.references Zoph, B., Bello, I., Kumar, S., Du, N., Huang, Y., Dean, J., Shazeer, N., & Fedus, W. (2022). Designing effective sparse expert models. arXiv preprint arXiv:2202.08906. es_ES


Este ítem aparece en la(s) siguiente(s) colección(ones)

Mostrar el registro sencillo del ítem