Can language models automate data wrangling?

Jaimovitch-López, Gonzalo; Ferri Ramírez, César; Hernández-Orallo, José; Martínez-Plumed, Fernando; Ramírez Quintana, María José

doi:10.1007/s10994-022-06259-9

Identificarse

Buscar en RiuNet

Listar

Todo RiuNet
Esta colección

Mi cuenta

Acceder

Estadísticas

Ver Estadísticas de uso

Ayuda RiuNet

Admin. UPV

Compartir/Enviar a

Citas

Estadísticas

Can language models automate data wrangling?

Mostrar el registro sencillo del ítem

Ficheros en el ítem

Nombre: Jaimovitch-LopezF ...

Tamaño: 1.923Mb

Formato: PDF

Descripción: Versión editorial

Abrir

dc.contributor.author	Jaimovitch-López, Gonzalo	es_ES
dc.contributor.author	Ferri Ramírez, César	es_ES
dc.contributor.author	Hernández-Orallo, José	es_ES
dc.contributor.author	Martínez-Plumed, Fernando	es_ES
dc.contributor.author	Ramírez Quintana, María José	es_ES
dc.date.accessioned	2023-08-28T18:00:31Z
dc.date.available	2023-08-28T18:00:31Z
dc.date.issued	2023-06	es_ES
dc.identifier.issn	0885-6125	es_ES
dc.identifier.uri	http://hdl.handle.net/10251/195751
dc.description.abstract	[EN] The automation of data science and other data manipulation processes depend on the integration and formatting of 'messy' data. Data wrangling is an umbrella term for these tedious and time-consuming tasks. Tasks such as transforming dates, units or names expressed in different formats have been challenging for machine learning because (1) users expect to solve them with short cues or few examples, and (2) the problems depend heavily on domain knowledge. Interestingly, large language models today (1) can infer from very few examples or even a short clue in natural language, and (2) can integrate vast amounts of domain knowledge. It is then an important research question to analyse whether language models are a promising approach for data wrangling, especially as their capabilities continue growing. In this paper we apply different variants of the language model Generative Pre-trained Transformer (GPT) to five batteries covering a wide range of data wrangling problems. We compare the effect of prompts and few-shot regimes on their results and how they compare with specialised data wrangling systems and other tools. Our major finding is that they appear as a powerful tool for a wide range of data wrangling tasks. We provide some guidelines about how they can be integrated into data processing pipelines, provided the users can take advantage of their flexibility and the diversity of tasks to be addressed. However, reliability is still an important issue to overcome.	es_ES
dc.description.sponsorship	Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. This work was funded by the Future of Life Institute, FLI, under grant RFP2-152, the MIT-Spain - INDITEX Sustainability Seed Fund under project COST-OMIZE, the EU (FEDER) and Spanish MINECO under RTI2018-094403-B-C32 and PID2021-122830OB-C42, Generalitat Valenciana under PROMETEO/2019/098 and INNEST/2021/317, EU's Horizon 2020 research and innovation programme under grant agreement No. 952215 (TAILOR) and US DARPA HR00112120007 ReCOG-AI. AcknowledgementsWe thank Lidia Contreras for her help with the Data Wrangling Dataset Repository. We thank the anonymous reviewers from ECMLPKDD Workshop on Automating Data Science (ADS2021) and the anonymous reviewers of this special issue for their comments.	es_ES
dc.language	Inglés	es_ES
dc.publisher	Springer-Verlag	es_ES
dc.relation.ispartof	Machine Learning	es_ES
dc.rights	Reconocimiento (by)	es_ES
dc.subject	Data science automation	es_ES
dc.subject	Data wrangling	es_ES
dc.subject	Language models	es_ES
dc.subject	Machine learning pipelines	es_ES
dc.subject.classification	LENGUAJES Y SISTEMAS INFORMATICOS	es_ES
dc.title	Can language models automate data wrangling?	es_ES
dc.type	Artículo	es_ES
dc.identifier.doi	10.1007/s10994-022-06259-9	es_ES
dc.relation.projectID	info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020/RTI2018-094403-B-C32/ES/RAZONAMIENTO FORMAL PARA TECNOLOGIAS FACILITADORAS Y EMERGENTES/	es_ES
dc.relation.projectID	info:eu-repo/grantAgreement/GVA//PROMETEO%2F2019%2F098//DEEPTRUST/	es_ES
dc.relation.projectID	info:eu-repo/grantAgreement/EC/H2020/952215/EU	es_ES
dc.relation.projectID	info:eu-repo/grantAgreement/GVA//INNEST%2F2021%2F317/	es_ES
dc.relation.projectID	info:eu-repo/grantAgreement/MINECO//PID2021-122830OB-C42/	es_ES
dc.relation.projectID	info:eu-repo/grantAgreement/FLI//RFP2-152/	es_ES
dc.relation.projectID	info:eu-repo/grantAgreement/DOD//HR00112120007/	es_ES
dc.relation.projectID	info:eu-repo/grantAgreement/MIT//COST-OMIZE/	es_ES
dc.rights.accessRights	Abierto	es_ES
dc.contributor.affiliation	Universitat Politècnica de València. Escola Tècnica Superior d'Enginyeria Informàtica	es_ES
dc.description.bibliographicCitation	Jaimovitch-López, G.; Ferri Ramírez, C.; Hernández-Orallo, J.; Martínez-Plumed, F.; Ramírez Quintana, MJ. (2023). Can language models automate data wrangling?. Machine Learning. 112(6):2053-2082. https://doi.org/10.1007/s10994-022-06259-9	es_ES
dc.description.accrualMethod	S	es_ES
dc.relation.publisherversion	https://doi.org/10.1007/s10994-022-06259-9	es_ES
dc.description.upvformatpinicio	2053	es_ES
dc.description.upvformatpfin	2082	es_ES
dc.type.version	info:eu-repo/semantics/publishedVersion	es_ES
dc.description.volume	112	es_ES
dc.description.issue	6	es_ES
dc.relation.pasarela	S\488506	es_ES
dc.contributor.funder	European Commission	es_ES
dc.contributor.funder	Generalitat Valenciana	es_ES
dc.contributor.funder	Future of Life Institute	es_ES
dc.contributor.funder	U.S. Department of Defense	es_ES
dc.contributor.funder	Agencia Estatal de Investigación	es_ES
dc.contributor.funder	European Regional Development Fund	es_ES
dc.contributor.funder	Massachusetts Institute of Technology	es_ES
dc.contributor.funder	Universitat Politècnica de València	es_ES
dc.contributor.funder	Ministerio de Economía y Competitividad	es_ES
dc.description.references	Ashok, P., & Nawaz, G. K. (2016). Outlier detection method on uci repository dataset by entropy based rough k-means. Defence Science Journal, 66(2), 113–121.	es_ES
dc.description.references	Bellmann, P., & Schwenker, F. (2020). Ordinal classification: Working definition and detection of ordinal structures. IEEE Access, 8, 164380–164391. https://doi.org/10.1109/ACCESS.2020.3021596	es_ES
dc.description.references	Ben-Gal, I. (2005). Outlier detection. In Data mining and knowledge discovery handbook (pp. 131–146). Springer.	es_ES
dc.description.references	Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (pp. 610–623). FAccT ’21.	es_ES
dc.description.references	Bengio, Y., Ducharme, R., Vincent, P., & Janvin, C. (2003). A neural probabilistic language model. The Journal of Machine Learning Research, 3, 1137–1155.	es_ES
dc.description.references	Bhupatiraju, S., Singh, R., Mohamed, A. R., & Kohli, P. (2017). Deep API programmer: Learning to program with APIs. arXiv preprint arXiv:1704.04327.	es_ES
dc.description.references	BIG-bench collaboration. (2022). Beyond the imitation game: Measuring and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615. https://github.com/google/BIG-bench/	es_ES
dc.description.references	Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., & Agarwal, S. (2020). Language models are few-shot learners. arXiv preprint arXiv:2005.14165.	es_ES
dc.description.references	Chen, Y., Dang, X., Peng, H., & Bart, H. L. (2008). Outlier detection with the kernelized spatial depth function. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(2), 288–305.	es_ES
dc.description.references	Contreras-Ochando, L., Ferri, C., & Hernández-Orallo, J. (2019a). Automating common data science matrix transformations. In ECMLPKDD workshop on Automating Data Science. ECML-PKDD ’19.	es_ES
dc.description.references	Contreras-Ochando, L., Ferri, C., Hernández-Orallo, J., Martínez-Plumed, F., Ramírez-Quintana, M. J., & Katayama, S. (2019b). Automated data transformation with inductive programming and dynamic background knowledge. In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases, ECML PKDD 2019. ECML-PKDD ’19.	es_ES
dc.description.references	Cropper, A., Tamaddoni, A., & Muggleton, S. H. (2015). Meta-interpretive learning of data transformation programs. In Inductive Logic Programming (pp. 46–59).	es_ES
dc.description.references	Das, K., & Schneider, J. (2007). Detecting anomalous records in categorical datasets. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 220–229).	es_ES
dc.description.references	Das, K., Schneider, J., & Neill, D. B. (2008). Anomaly pattern detection in categorical datasets. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 169–176).	es_ES
dc.description.references	De Bie, T., De Raedt, L., Hernández-Orallo, J., Hoos, H. H., Smyth, P., & Williams, C. K. I. (2022). Automating data science: Prospects and challenges. Communications of the ACM, 65(3), 76–87.	es_ES
dc.description.references	Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.	es_ES
dc.description.references	Dua, D., & Graff, C. (2017). UCI machine learning repository. http://archive.ics.uci.edu/ml.	es_ES
dc.description.references	Ellis, K., & Gulwani, S. (2017). Learning to learn programs from examples: Going beyond program structure. In IJCAI (pp. 1638–1645).	es_ES
dc.description.references	Fernando, M. P., Cèsar, F., David, N., & José, H. O. (2021). Missing the missing values: The ugly duckling of fairness in machine learning. International Journal of Intelligent Systems, 36(7), 3217–3258.	es_ES
dc.description.references	Ferrari, A., & Russo, M. (2016). Introducing Microsoft Power BI. Microsoft Press.	es_ES
dc.description.references	Furche, T., Gottlob, G., Libkin, L., Orsi, G., & Paton, N. W. (2016). Data wrangling for big data. Challenges and opportunities. EDBT, 16, 473–478.	es_ES
dc.description.references	Gao, T., Fisch, A., & Chen, D. (2020). Making pre-trained language models better few-shot learners. arXiv preprint arXiv:2012.15723.	es_ES
dc.description.references	García, S., Ramírez-Gallego, S., Luengo, J., Benítez, J. M., & Herrera, F. (2016). Big data preprocessing: Methods and prospects. Big Data Analytics, 1(1), 1–22.	es_ES
dc.description.references	Gulwani, S. (2011). Automating string processing in spreadsheets using input-output examples. In Procs. 38th Principles of Programming Languages (pp. 317–330).	es_ES
dc.description.references	Gulwani, S., Hernández-Orallo, J., Kitzelmann, E., Muggleton, S. H., Schmid, U., & Zorn, B. (2015). Inductive programming meets the real world. Communications of the ACM, 58(11), 90–99.	es_ES
dc.description.references	Ham, K. (2013). OpenRefine (version 2.5). http://openrefine.org.free/ Open-source tool for cleaning and transforming data. Journal of the Medical Library Association: JMLA, 101 (3), 233.	es_ES
dc.description.references	He, Z., Xu, X., Huang, Z. J., & Deng, S. (2005). Fp-outlier: Frequent pattern based outlier detection. Computer Science and Information Systems, 2(1), 103–118.	es_ES
dc.description.references	Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., & Steinhardt, J. (2021). Measuring massive multitask language understanding. In ICLR.	es_ES
dc.description.references	Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., & Steinhardt, J. (2021). Measuring mathematical problem solving with the MATH dataset. In CoRR. arxiv:2103.03874.	es_ES
dc.description.references	Hulsebos, M., Hu, K., Bakker, M., Zgraggen, E., Satyanarayan, A., Kraska, T., Demiralp, Ç., & Hidalgo, C. (2019). Sherlock: A deep learning approach to semantic data type detection. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 1500–1508).	es_ES
dc.description.references	Izacard, G., & Grave, E. (2020). Leveraging passage retrieval with generative models for open domain question answering. arXiv preprint arXiv:2007.01282.	es_ES
dc.description.references	Jaimovitch-Lopez, G., Ferri, C., Hernandez-Orallo, J., Martinez-Plumed, F., & Ramirez-Quintana, M. J. (2021). Can language models automate data wrangling?. In ECML/PKDD Workshop on Automated Data Science (ADS2021). https://sites.google.com/view/autods.	es_ES
dc.description.references	Kandel, S., Paepcke, A., Hellerstein, J., & Heer, J. (2011). Wrangler: Interactive visual specification of data transformation scripts. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 3363–3372). ACM.	es_ES
dc.description.references	Lazarevic, A., & Kumar, V. (2005). Feature bagging for outlier detection. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining (pp. 157–166).	es_ES
dc.description.references	Lu, Y., Bartolo, M., Moore, A., Riedel, S., & Stenetorp, P. (2021). Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. arXiv preprint arXiv:2104.08786.	es_ES
dc.description.references	Nazabal, A., Williams, C. K., Colavizza, G., Smith, C. R., & Williams, A. (2020). Data engineering for data analytics: A classification of the issues, and case studies. arXiv preprint arXiv:2004.12929.	es_ES
dc.description.references	Noto, K., Brodley, C., & Slonim, D. (2012). Frac: A feature-modeling approach for semi-supervised and unsupervised anomaly detection. Data Mining and Knowledge Discovery, 25(1), 109–133.	es_ES
dc.description.references	Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.	es_ES
dc.description.references	Petrova-Antonova, D., & Tancheva, R. (2020). Data cleaning: A case study with OpenRefine and Trifacta Wrangler. In International Conference on the Quality of Information and Communications Technology (pp. 32–40). Springer.	es_ES
dc.description.references	Porwal, U., & Mukund, S. (2017). Outlier detection by consistent data selection method. arXiv preprint arXiv:1712.04129.	es_ES
dc.description.references	Puri, R., & Catanzaro, B. (2019). Zero-shot text classification with generative language models. arXiv preprint arXiv:1912.10165.	es_ES
dc.description.references	Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI Blog, 1(8), 9.	es_ES
dc.description.references	Raman, V., & Hellerstein, J. M. (2001). Potter’s wheel: An interactive data cleaning system. In VLDB (Vol. 1, pp. 381–390).	es_ES
dc.description.references	Reed, S., Zolna, K., Parisotto, E., Colmenarejo, S. G., Novikov, A., Barth-Maron, G., Gimenez, M., Sulsky, Y., Kay, J., Springenberg, J. T., & Eccles, T. (2022). A generalist agent. arXiv preprint arXiv:2205.06175.	es_ES
dc.description.references	Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581–592.	es_ES
dc.description.references	Schick, T., & Schütze, H. (2020). Exploiting cloze questions for few-shot text classification and natural language inference. arXiv preprint arXiv:2001.07676.	es_ES
dc.description.references	Shannon, C. E. (1949). Communication theory of secrecy systems. The Bell System Technical Journal, 28(4), 656–715.	es_ES
dc.description.references	Shi, Y., Li, W., & Sha, F. (2016). Metric learning for ordinal data. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 30).	es_ES
dc.description.references	Singh, R., & Gulwani, S. (2015). Predicting a correct program in programming by example. In International Conference on Computer Aided Verification (pp. 398–414). Springer.	es_ES
dc.description.references	Singh, R., & Gulwani, S. (2016). Transforming spreadsheet data types using examples. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (pp. 343–356).	es_ES
dc.description.references	Sleeper, R. (2021). Tableau Desktop Pocket Reference. O’Reilly Media Inc.	es_ES
dc.description.references	Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., & Zhang, E. (2022). Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990.	es_ES
dc.description.references	Tamkin, A., Brundage, M., Clark, J., & Ganguli, D. (2021). Understanding the capabilities, limitations, and societal impact of large language models. arXiv preprint arXiv:2102.02503.	es_ES
dc.description.references	Terrizzano, I. G., Schwarz, P. M., Roth, M., & Colino, J. E. (2015). Data wrangling: The challenging journey from the wild to the lake. In CIDR.	es_ES
dc.description.references	Trifacta (2022): Trifacta Wrangler. https://www.trifacta.com	es_ES
dc.description.references	Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. arXiv preprint arXiv:1706.03762.	es_ES
dc.description.references	Wei, J., Bosma, M. P., Zhao, V., Guu, K., Yu, A. W., Lester, B., Du, N., Dai, A. M., & Le, Q. V. (2022). Finetuned language models are zero-shot learners. https://openreview.net/forum?id=gEZrGCozdqR	es_ES
dc.description.references	Wu, B., Szekely, P., & Knoblock, C. A. (2012). Learning data transformation rules through examples: Preliminary results. In Information Integration on the Web (p. 8).	es_ES
dc.description.references	Xu, S., Semnani, S. J., Campagna, G., & Lam, M. S. (2020). AutoQA: From databases to QA semantic parsers with only synthetic training data. In EMNLP.	es_ES
dc.description.references	Zeng, W., Ren, X., Su, T., Wang, H., Liao, Y., Wang, Z., Jiang, X., Yang, Z., Wang, K., Zhang, X., & Li, C. (2021). Pangu-$$\alpha$$: Large-scale autoregressive pretrained chinese language models with auto-parallel computation. arXiv preprint arXiv:2104.12369.	es_ES
dc.description.references	Zhang, D., Suhara, Y., Li, J., Hulsebos, M., Demiralp, Ç., & Tan, W. C. (2019). Sato: Contextual semantic type detection in tables. arXiv preprint arXiv:1911.06311.	es_ES
dc.description.references	Zoph, B., Bello, I., Kumar, S., Du, N., Huang, Y., Dean, J., Shazeer, N., & Fedus, W. (2022). Designing effective sparse expert models. arXiv preprint arXiv:2202.08906.	es_ES

Este ítem aparece en la(s) siguiente(s) colección(ones)

Mostrar el registro sencillo del ítem

Can language models automate data wrangling?

RiuNet: Repositorio Institucional de la Universidad Politécnica de Valencia

Buscar en RiuNet

Listar

Todo RiuNet

Esta colección

Mi cuenta

Estadísticas

Ayuda RiuNet

Admin. UPV

Compartir/Enviar a

Citas

Estadísticas

Can language models automate data wrangling?

Ficheros en el ítem

Este ítem aparece en la(s) siguiente(s) colección(ones)