Mostrar el registro sencillo del ítem
dc.contributor.author | Jaimovitch-López, Gonzalo | es_ES |
dc.contributor.author | Ferri Ramírez, César | es_ES |
dc.contributor.author | Hernández-Orallo, José | es_ES |
dc.contributor.author | Martínez-Plumed, Fernando | es_ES |
dc.contributor.author | Ramírez Quintana, María José | es_ES |
dc.date.accessioned | 2023-08-28T18:00:31Z | |
dc.date.available | 2023-08-28T18:00:31Z | |
dc.date.issued | 2023-06 | es_ES |
dc.identifier.issn | 0885-6125 | es_ES |
dc.identifier.uri | http://hdl.handle.net/10251/195751 | |
dc.description.abstract | [EN] The automation of data science and other data manipulation processes depend on the integration and formatting of 'messy' data. Data wrangling is an umbrella term for these tedious and time-consuming tasks. Tasks such as transforming dates, units or names expressed in different formats have been challenging for machine learning because (1) users expect to solve them with short cues or few examples, and (2) the problems depend heavily on domain knowledge. Interestingly, large language models today (1) can infer from very few examples or even a short clue in natural language, and (2) can integrate vast amounts of domain knowledge. It is then an important research question to analyse whether language models are a promising approach for data wrangling, especially as their capabilities continue growing. In this paper we apply different variants of the language model Generative Pre-trained Transformer (GPT) to five batteries covering a wide range of data wrangling problems. We compare the effect of prompts and few-shot regimes on their results and how they compare with specialised data wrangling systems and other tools. Our major finding is that they appear as a powerful tool for a wide range of data wrangling tasks. We provide some guidelines about how they can be integrated into data processing pipelines, provided the users can take advantage of their flexibility and the diversity of tasks to be addressed. However, reliability is still an important issue to overcome. | es_ES |
dc.description.sponsorship | Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. This work was funded by the Future of Life Institute, FLI, under grant RFP2-152, the MIT-Spain - INDITEX Sustainability Seed Fund under project COST-OMIZE, the EU (FEDER) and Spanish MINECO under RTI2018-094403-B-C32 and PID2021-122830OB-C42, Generalitat Valenciana under PROMETEO/2019/098 and INNEST/2021/317, EU's Horizon 2020 research and innovation programme under grant agreement No. 952215 (TAILOR) and US DARPA HR00112120007 ReCOG-AI. AcknowledgementsWe thank Lidia Contreras for her help with the Data Wrangling Dataset Repository. We thank the anonymous reviewers from ECMLPKDD Workshop on Automating Data Science (ADS2021) and the anonymous reviewers of this special issue for their comments. | es_ES |
dc.language | Inglés | es_ES |
dc.publisher | Springer-Verlag | es_ES |
dc.relation.ispartof | Machine Learning | es_ES |
dc.rights | Reconocimiento (by) | es_ES |
dc.subject | Data science automation | es_ES |
dc.subject | Data wrangling | es_ES |
dc.subject | Language models | es_ES |
dc.subject | Machine learning pipelines | es_ES |
dc.subject.classification | LENGUAJES Y SISTEMAS INFORMATICOS | es_ES |
dc.title | Can language models automate data wrangling? | es_ES |
dc.type | Artículo | es_ES |
dc.identifier.doi | 10.1007/s10994-022-06259-9 | es_ES |
dc.relation.projectID | info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020/RTI2018-094403-B-C32/ES/RAZONAMIENTO FORMAL PARA TECNOLOGIAS FACILITADORAS Y EMERGENTES/ | es_ES |
dc.relation.projectID | info:eu-repo/grantAgreement/GVA//PROMETEO%2F2019%2F098//DEEPTRUST/ | es_ES |
dc.relation.projectID | info:eu-repo/grantAgreement/EC/H2020/952215/EU | es_ES |
dc.relation.projectID | info:eu-repo/grantAgreement/GVA//INNEST%2F2021%2F317/ | es_ES |
dc.relation.projectID | info:eu-repo/grantAgreement/MINECO//PID2021-122830OB-C42/ | es_ES |
dc.relation.projectID | info:eu-repo/grantAgreement/FLI//RFP2-152/ | es_ES |
dc.relation.projectID | info:eu-repo/grantAgreement/DOD//HR00112120007/ | es_ES |
dc.relation.projectID | info:eu-repo/grantAgreement/MIT//COST-OMIZE/ | es_ES |
dc.rights.accessRights | Abierto | es_ES |
dc.contributor.affiliation | Universitat Politècnica de València. Escola Tècnica Superior d'Enginyeria Informàtica | es_ES |
dc.description.bibliographicCitation | Jaimovitch-López, G.; Ferri Ramírez, C.; Hernández-Orallo, J.; Martínez-Plumed, F.; Ramírez Quintana, MJ. (2023). Can language models automate data wrangling?. Machine Learning. 112(6):2053-2082. https://doi.org/10.1007/s10994-022-06259-9 | es_ES |
dc.description.accrualMethod | S | es_ES |
dc.relation.publisherversion | https://doi.org/10.1007/s10994-022-06259-9 | es_ES |
dc.description.upvformatpinicio | 2053 | es_ES |
dc.description.upvformatpfin | 2082 | es_ES |
dc.type.version | info:eu-repo/semantics/publishedVersion | es_ES |
dc.description.volume | 112 | es_ES |
dc.description.issue | 6 | es_ES |
dc.relation.pasarela | S\488506 | es_ES |
dc.contributor.funder | European Commission | es_ES |
dc.contributor.funder | Generalitat Valenciana | es_ES |
dc.contributor.funder | Future of Life Institute | es_ES |
dc.contributor.funder | U.S. Department of Defense | es_ES |
dc.contributor.funder | Agencia Estatal de Investigación | es_ES |
dc.contributor.funder | European Regional Development Fund | es_ES |
dc.contributor.funder | Massachusetts Institute of Technology | es_ES |
dc.contributor.funder | Universitat Politècnica de València | es_ES |
dc.contributor.funder | Ministerio de Economía y Competitividad | es_ES |
dc.description.references | Ashok, P., & Nawaz, G. K. (2016). Outlier detection method on uci repository dataset by entropy based rough k-means. Defence Science Journal, 66(2), 113–121. | es_ES |
dc.description.references | Bellmann, P., & Schwenker, F. (2020). Ordinal classification: Working definition and detection of ordinal structures. IEEE Access, 8, 164380–164391. https://doi.org/10.1109/ACCESS.2020.3021596 | es_ES |
dc.description.references | Ben-Gal, I. (2005). Outlier detection. In Data mining and knowledge discovery handbook (pp. 131–146). Springer. | es_ES |
dc.description.references | Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (pp. 610–623). FAccT ’21. | es_ES |
dc.description.references | Bengio, Y., Ducharme, R., Vincent, P., & Janvin, C. (2003). A neural probabilistic language model. The Journal of Machine Learning Research, 3, 1137–1155. | es_ES |
dc.description.references | Bhupatiraju, S., Singh, R., Mohamed, A. R., & Kohli, P. (2017). Deep API programmer: Learning to program with APIs. arXiv preprint arXiv:1704.04327. | es_ES |
dc.description.references | BIG-bench collaboration. (2022). Beyond the imitation game: Measuring and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615. https://github.com/google/BIG-bench/ | es_ES |
dc.description.references | Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., & Agarwal, S. (2020). Language models are few-shot learners. arXiv preprint arXiv:2005.14165. | es_ES |
dc.description.references | Chen, Y., Dang, X., Peng, H., & Bart, H. L. (2008). Outlier detection with the kernelized spatial depth function. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(2), 288–305. | es_ES |
dc.description.references | Contreras-Ochando, L., Ferri, C., & Hernández-Orallo, J. (2019a). Automating common data science matrix transformations. In ECMLPKDD workshop on Automating Data Science. ECML-PKDD ’19. | es_ES |
dc.description.references | Contreras-Ochando, L., Ferri, C., Hernández-Orallo, J., Martínez-Plumed, F., Ramírez-Quintana, M. J., & Katayama, S. (2019b). Automated data transformation with inductive programming and dynamic background knowledge. In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases, ECML PKDD 2019. ECML-PKDD ’19. | es_ES |
dc.description.references | Cropper, A., Tamaddoni, A., & Muggleton, S. H. (2015). Meta-interpretive learning of data transformation programs. In Inductive Logic Programming (pp. 46–59). | es_ES |
dc.description.references | Das, K., & Schneider, J. (2007). Detecting anomalous records in categorical datasets. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 220–229). | es_ES |
dc.description.references | Das, K., Schneider, J., & Neill, D. B. (2008). Anomaly pattern detection in categorical datasets. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 169–176). | es_ES |
dc.description.references | De Bie, T., De Raedt, L., Hernández-Orallo, J., Hoos, H. H., Smyth, P., & Williams, C. K. I. (2022). Automating data science: Prospects and challenges. Communications of the ACM, 65(3), 76–87. | es_ES |
dc.description.references | Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. | es_ES |
dc.description.references | Dua, D., & Graff, C. (2017). UCI machine learning repository. http://archive.ics.uci.edu/ml. | es_ES |
dc.description.references | Ellis, K., & Gulwani, S. (2017). Learning to learn programs from examples: Going beyond program structure. In IJCAI (pp. 1638–1645). | es_ES |
dc.description.references | Fernando, M. P., Cèsar, F., David, N., & José, H. O. (2021). Missing the missing values: The ugly duckling of fairness in machine learning. International Journal of Intelligent Systems, 36(7), 3217–3258. | es_ES |
dc.description.references | Ferrari, A., & Russo, M. (2016). Introducing Microsoft Power BI. Microsoft Press. | es_ES |
dc.description.references | Furche, T., Gottlob, G., Libkin, L., Orsi, G., & Paton, N. W. (2016). Data wrangling for big data. Challenges and opportunities. EDBT, 16, 473–478. | es_ES |
dc.description.references | Gao, T., Fisch, A., & Chen, D. (2020). Making pre-trained language models better few-shot learners. arXiv preprint arXiv:2012.15723. | es_ES |
dc.description.references | García, S., Ramírez-Gallego, S., Luengo, J., Benítez, J. M., & Herrera, F. (2016). Big data preprocessing: Methods and prospects. Big Data Analytics, 1(1), 1–22. | es_ES |
dc.description.references | Gulwani, S. (2011). Automating string processing in spreadsheets using input-output examples. In Procs. 38th Principles of Programming Languages (pp. 317–330). | es_ES |
dc.description.references | Gulwani, S., Hernández-Orallo, J., Kitzelmann, E., Muggleton, S. H., Schmid, U., & Zorn, B. (2015). Inductive programming meets the real world. Communications of the ACM, 58(11), 90–99. | es_ES |
dc.description.references | Ham, K. (2013). OpenRefine (version 2.5). http://openrefine.org.free/ Open-source tool for cleaning and transforming data. Journal of the Medical Library Association: JMLA, 101 (3), 233. | es_ES |
dc.description.references | He, Z., Xu, X., Huang, Z. J., & Deng, S. (2005). Fp-outlier: Frequent pattern based outlier detection. Computer Science and Information Systems, 2(1), 103–118. | es_ES |
dc.description.references | Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., & Steinhardt, J. (2021). Measuring massive multitask language understanding. In ICLR. | es_ES |
dc.description.references | Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., & Steinhardt, J. (2021). Measuring mathematical problem solving with the MATH dataset. In CoRR. arxiv:2103.03874. | es_ES |
dc.description.references | Hulsebos, M., Hu, K., Bakker, M., Zgraggen, E., Satyanarayan, A., Kraska, T., Demiralp, Ç., & Hidalgo, C. (2019). Sherlock: A deep learning approach to semantic data type detection. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 1500–1508). | es_ES |
dc.description.references | Izacard, G., & Grave, E. (2020). Leveraging passage retrieval with generative models for open domain question answering. arXiv preprint arXiv:2007.01282. | es_ES |
dc.description.references | Jaimovitch-Lopez, G., Ferri, C., Hernandez-Orallo, J., Martinez-Plumed, F., & Ramirez-Quintana, M. J. (2021). Can language models automate data wrangling?. In ECML/PKDD Workshop on Automated Data Science (ADS2021). https://sites.google.com/view/autods. | es_ES |
dc.description.references | Kandel, S., Paepcke, A., Hellerstein, J., & Heer, J. (2011). Wrangler: Interactive visual specification of data transformation scripts. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 3363–3372). ACM. | es_ES |
dc.description.references | Lazarevic, A., & Kumar, V. (2005). Feature bagging for outlier detection. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining (pp. 157–166). | es_ES |
dc.description.references | Lu, Y., Bartolo, M., Moore, A., Riedel, S., & Stenetorp, P. (2021). Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. arXiv preprint arXiv:2104.08786. | es_ES |
dc.description.references | Nazabal, A., Williams, C. K., Colavizza, G., Smith, C. R., & Williams, A. (2020). Data engineering for data analytics: A classification of the issues, and case studies. arXiv preprint arXiv:2004.12929. | es_ES |
dc.description.references | Noto, K., Brodley, C., & Slonim, D. (2012). Frac: A feature-modeling approach for semi-supervised and unsupervised anomaly detection. Data Mining and Knowledge Discovery, 25(1), 109–133. | es_ES |
dc.description.references | Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830. | es_ES |
dc.description.references | Petrova-Antonova, D., & Tancheva, R. (2020). Data cleaning: A case study with OpenRefine and Trifacta Wrangler. In International Conference on the Quality of Information and Communications Technology (pp. 32–40). Springer. | es_ES |
dc.description.references | Porwal, U., & Mukund, S. (2017). Outlier detection by consistent data selection method. arXiv preprint arXiv:1712.04129. | es_ES |
dc.description.references | Puri, R., & Catanzaro, B. (2019). Zero-shot text classification with generative language models. arXiv preprint arXiv:1912.10165. | es_ES |
dc.description.references | Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI Blog, 1(8), 9. | es_ES |
dc.description.references | Raman, V., & Hellerstein, J. M. (2001). Potter’s wheel: An interactive data cleaning system. In VLDB (Vol. 1, pp. 381–390). | es_ES |
dc.description.references | Reed, S., Zolna, K., Parisotto, E., Colmenarejo, S. G., Novikov, A., Barth-Maron, G., Gimenez, M., Sulsky, Y., Kay, J., Springenberg, J. T., & Eccles, T. (2022). A generalist agent. arXiv preprint arXiv:2205.06175. | es_ES |
dc.description.references | Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581–592. | es_ES |
dc.description.references | Schick, T., & Schütze, H. (2020). Exploiting cloze questions for few-shot text classification and natural language inference. arXiv preprint arXiv:2001.07676. | es_ES |
dc.description.references | Shannon, C. E. (1949). Communication theory of secrecy systems. The Bell System Technical Journal, 28(4), 656–715. | es_ES |
dc.description.references | Shi, Y., Li, W., & Sha, F. (2016). Metric learning for ordinal data. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 30). | es_ES |
dc.description.references | Singh, R., & Gulwani, S. (2015). Predicting a correct program in programming by example. In International Conference on Computer Aided Verification (pp. 398–414). Springer. | es_ES |
dc.description.references | Singh, R., & Gulwani, S. (2016). Transforming spreadsheet data types using examples. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (pp. 343–356). | es_ES |
dc.description.references | Sleeper, R. (2021). Tableau Desktop Pocket Reference. O’Reilly Media Inc. | es_ES |
dc.description.references | Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., & Zhang, E. (2022). Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990. | es_ES |
dc.description.references | Tamkin, A., Brundage, M., Clark, J., & Ganguli, D. (2021). Understanding the capabilities, limitations, and societal impact of large language models. arXiv preprint arXiv:2102.02503. | es_ES |
dc.description.references | Terrizzano, I. G., Schwarz, P. M., Roth, M., & Colino, J. E. (2015). Data wrangling: The challenging journey from the wild to the lake. In CIDR. | es_ES |
dc.description.references | Trifacta (2022): Trifacta Wrangler. https://www.trifacta.com | es_ES |
dc.description.references | Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. arXiv preprint arXiv:1706.03762. | es_ES |
dc.description.references | Wei, J., Bosma, M. P., Zhao, V., Guu, K., Yu, A. W., Lester, B., Du, N., Dai, A. M., & Le, Q. V. (2022). Finetuned language models are zero-shot learners. https://openreview.net/forum?id=gEZrGCozdqR | es_ES |
dc.description.references | Wu, B., Szekely, P., & Knoblock, C. A. (2012). Learning data transformation rules through examples: Preliminary results. In Information Integration on the Web (p. 8). | es_ES |
dc.description.references | Xu, S., Semnani, S. J., Campagna, G., & Lam, M. S. (2020). AutoQA: From databases to QA semantic parsers with only synthetic training data. In EMNLP. | es_ES |
dc.description.references | Zeng, W., Ren, X., Su, T., Wang, H., Liao, Y., Wang, Z., Jiang, X., Yang, Z., Wang, K., Zhang, X., & Li, C. (2021). Pangu-$$\alpha$$: Large-scale autoregressive pretrained chinese language models with auto-parallel computation. arXiv preprint arXiv:2104.12369. | es_ES |
dc.description.references | Zhang, D., Suhara, Y., Li, J., Hulsebos, M., Demiralp, Ç., & Tan, W. C. (2019). Sato: Contextual semantic type detection in tables. arXiv preprint arXiv:1911.06311. | es_ES |
dc.description.references | Zoph, B., Bello, I., Kumar, S., Du, N., Huang, Y., Dean, J., Shazeer, N., & Fedus, W. (2022). Designing effective sparse expert models. arXiv preprint arXiv:2202.08906. | es_ES |