- -

Can language models automate data wrangling?

RiuNet: Repositorio Institucional de la Universidad Politécnica de Valencia

Compartir/Enviar a

Citas

Estadísticas

  • Estadisticas de Uso

Can language models automate data wrangling?

Mostrar el registro completo del ítem

Jaimovitch-López, G.; Ferri Ramírez, C.; Hernández-Orallo, J.; Martínez-Plumed, F.; Ramírez Quintana, MJ. (2023). Can language models automate data wrangling?. Machine Learning. 112(6):2053-2082. https://doi.org/10.1007/s10994-022-06259-9

Por favor, use este identificador para citar o enlazar este ítem: http://hdl.handle.net/10251/195751

Ficheros en el ítem

Metadatos del ítem

Título: Can language models automate data wrangling?
Autor: Jaimovitch-López, Gonzalo Ferri Ramírez, César Hernández-Orallo, José Martínez-Plumed, Fernando Ramírez Quintana, María José
Entidad UPV: Universitat Politècnica de València. Escola Tècnica Superior d'Enginyeria Informàtica
Fecha difusión:
Resumen:
[EN] The automation of data science and other data manipulation processes depend on the integration and formatting of 'messy' data. Data wrangling is an umbrella term for these tedious and time-consuming tasks. Tasks such ...[+]
Palabras clave: Data science automation , Data wrangling , Language models , Machine learning pipelines
Derechos de uso: Reconocimiento (by)
Fuente:
Machine Learning. (issn: 0885-6125 )
DOI: 10.1007/s10994-022-06259-9
Editorial:
Springer-Verlag
Versión del editor: https://doi.org/10.1007/s10994-022-06259-9
Código del Proyecto:
info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020/RTI2018-094403-B-C32/ES/RAZONAMIENTO FORMAL PARA TECNOLOGIAS FACILITADORAS Y EMERGENTES/
...[+]
info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020/RTI2018-094403-B-C32/ES/RAZONAMIENTO FORMAL PARA TECNOLOGIAS FACILITADORAS Y EMERGENTES/
info:eu-repo/grantAgreement/GVA//PROMETEO%2F2019%2F098//DEEPTRUST/
info:eu-repo/grantAgreement/EC/H2020/952215/EU
info:eu-repo/grantAgreement/GVA//INNEST%2F2021%2F317/
info:eu-repo/grantAgreement/MINECO//PID2021-122830OB-C42/
info:eu-repo/grantAgreement/FLI//RFP2-152/
info:eu-repo/grantAgreement/DOD//HR00112120007/
info:eu-repo/grantAgreement/MIT//COST-OMIZE/
[-]
Agradecimientos:
Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. This work was funded by the Future of Life Institute, FLI, under grant RFP2-152, the MIT-Spain - INDITEX Sustainability Seed Fund under ...[+]
Tipo: Artículo

References

Ashok, P., & Nawaz, G. K. (2016). Outlier detection method on uci repository dataset by entropy based rough k-means. Defence Science Journal, 66(2), 113–121.

Bellmann, P., & Schwenker, F. (2020). Ordinal classification: Working definition and detection of ordinal structures. IEEE Access, 8, 164380–164391. https://doi.org/10.1109/ACCESS.2020.3021596

Ben-Gal, I. (2005). Outlier detection. In Data mining and knowledge discovery handbook (pp. 131–146). Springer. [+]
Ashok, P., & Nawaz, G. K. (2016). Outlier detection method on uci repository dataset by entropy based rough k-means. Defence Science Journal, 66(2), 113–121.

Bellmann, P., & Schwenker, F. (2020). Ordinal classification: Working definition and detection of ordinal structures. IEEE Access, 8, 164380–164391. https://doi.org/10.1109/ACCESS.2020.3021596

Ben-Gal, I. (2005). Outlier detection. In Data mining and knowledge discovery handbook (pp. 131–146). Springer.

Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (pp. 610–623). FAccT ’21.

Bengio, Y., Ducharme, R., Vincent, P., & Janvin, C. (2003). A neural probabilistic language model. The Journal of Machine Learning Research, 3, 1137–1155.

Bhupatiraju, S., Singh, R., Mohamed, A. R., & Kohli, P. (2017). Deep API programmer: Learning to program with APIs. arXiv preprint arXiv:1704.04327.

BIG-bench collaboration. (2022). Beyond the imitation game: Measuring and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615. https://github.com/google/BIG-bench/

Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., & Agarwal, S. (2020). Language models are few-shot learners. arXiv preprint arXiv:2005.14165.

Chen, Y., Dang, X., Peng, H., & Bart, H. L. (2008). Outlier detection with the kernelized spatial depth function. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(2), 288–305.

Contreras-Ochando, L., Ferri, C., & Hernández-Orallo, J. (2019a). Automating common data science matrix transformations. In ECMLPKDD workshop on Automating Data Science. ECML-PKDD ’19.

Contreras-Ochando, L., Ferri, C., Hernández-Orallo, J., Martínez-Plumed, F., Ramírez-Quintana, M. J., & Katayama, S. (2019b). Automated data transformation with inductive programming and dynamic background knowledge. In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases, ECML PKDD 2019. ECML-PKDD ’19.

Cropper, A., Tamaddoni, A., & Muggleton, S. H. (2015). Meta-interpretive learning of data transformation programs. In Inductive Logic Programming (pp. 46–59).

Das, K., & Schneider, J. (2007). Detecting anomalous records in categorical datasets. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 220–229).

Das, K., Schneider, J., & Neill, D. B. (2008). Anomaly pattern detection in categorical datasets. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 169–176).

De Bie, T., De Raedt, L., Hernández-Orallo, J., Hoos, H. H., Smyth, P., & Williams, C. K. I. (2022). Automating data science: Prospects and challenges. Communications of the ACM, 65(3), 76–87.

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

Dua, D., & Graff, C. (2017). UCI machine learning repository. http://archive.ics.uci.edu/ml.

Ellis, K., & Gulwani, S. (2017). Learning to learn programs from examples: Going beyond program structure. In IJCAI (pp. 1638–1645).

Fernando, M. P., Cèsar, F., David, N., & José, H. O. (2021). Missing the missing values: The ugly duckling of fairness in machine learning. International Journal of Intelligent Systems, 36(7), 3217–3258.

Ferrari, A., & Russo, M. (2016). Introducing Microsoft Power BI. Microsoft Press.

Furche, T., Gottlob, G., Libkin, L., Orsi, G., & Paton, N. W. (2016). Data wrangling for big data. Challenges and opportunities. EDBT, 16, 473–478.

Gao, T., Fisch, A., & Chen, D. (2020). Making pre-trained language models better few-shot learners. arXiv preprint arXiv:2012.15723.

García, S., Ramírez-Gallego, S., Luengo, J., Benítez, J. M., & Herrera, F. (2016). Big data preprocessing: Methods and prospects. Big Data Analytics, 1(1), 1–22.

Gulwani, S. (2011). Automating string processing in spreadsheets using input-output examples. In Procs. 38th Principles of Programming Languages (pp. 317–330).

Gulwani, S., Hernández-Orallo, J., Kitzelmann, E., Muggleton, S. H., Schmid, U., & Zorn, B. (2015). Inductive programming meets the real world. Communications of the ACM, 58(11), 90–99.

Ham, K. (2013). OpenRefine (version 2.5). http://openrefine.org.free/ Open-source tool for cleaning and transforming data. Journal of the Medical Library Association: JMLA, 101 (3), 233.

He, Z., Xu, X., Huang, Z. J., & Deng, S. (2005). Fp-outlier: Frequent pattern based outlier detection. Computer Science and Information Systems, 2(1), 103–118.

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., & Steinhardt, J. (2021). Measuring massive multitask language understanding. In ICLR.

Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., & Steinhardt, J. (2021). Measuring mathematical problem solving with the MATH dataset. In CoRR. arxiv:2103.03874.

Hulsebos, M., Hu, K., Bakker, M., Zgraggen, E., Satyanarayan, A., Kraska, T., Demiralp, Ç., & Hidalgo, C. (2019). Sherlock: A deep learning approach to semantic data type detection. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 1500–1508).

Izacard, G., & Grave, E. (2020). Leveraging passage retrieval with generative models for open domain question answering. arXiv preprint arXiv:2007.01282.

Jaimovitch-Lopez, G., Ferri, C., Hernandez-Orallo, J., Martinez-Plumed, F., & Ramirez-Quintana, M. J. (2021). Can language models automate data wrangling?. In ECML/PKDD Workshop on Automated Data Science (ADS2021). https://sites.google.com/view/autods.

Kandel, S., Paepcke, A., Hellerstein, J., & Heer, J. (2011). Wrangler: Interactive visual specification of data transformation scripts. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 3363–3372). ACM.

Lazarevic, A., & Kumar, V. (2005). Feature bagging for outlier detection. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining (pp. 157–166).

Lu, Y., Bartolo, M., Moore, A., Riedel, S., & Stenetorp, P. (2021). Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. arXiv preprint arXiv:2104.08786.

Nazabal, A., Williams, C. K., Colavizza, G., Smith, C. R., & Williams, A. (2020). Data engineering for data analytics: A classification of the issues, and case studies. arXiv preprint arXiv:2004.12929.

Noto, K., Brodley, C., & Slonim, D. (2012). Frac: A feature-modeling approach for semi-supervised and unsupervised anomaly detection. Data Mining and Knowledge Discovery, 25(1), 109–133.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

Petrova-Antonova, D., & Tancheva, R. (2020). Data cleaning: A case study with OpenRefine and Trifacta Wrangler. In International Conference on the Quality of Information and Communications Technology (pp. 32–40). Springer.

Porwal, U., & Mukund, S. (2017). Outlier detection by consistent data selection method. arXiv preprint arXiv:1712.04129.

Puri, R., & Catanzaro, B. (2019). Zero-shot text classification with generative language models. arXiv preprint arXiv:1912.10165.

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI Blog, 1(8), 9.

Raman, V., & Hellerstein, J. M. (2001). Potter’s wheel: An interactive data cleaning system. In VLDB (Vol. 1, pp. 381–390).

Reed, S., Zolna, K., Parisotto, E., Colmenarejo, S. G., Novikov, A., Barth-Maron, G., Gimenez, M., Sulsky, Y., Kay, J., Springenberg, J. T., & Eccles, T. (2022). A generalist agent. arXiv preprint arXiv:2205.06175.

Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581–592.

Schick, T., & Schütze, H. (2020). Exploiting cloze questions for few-shot text classification and natural language inference. arXiv preprint arXiv:2001.07676.

Shannon, C. E. (1949). Communication theory of secrecy systems. The Bell System Technical Journal, 28(4), 656–715.

Shi, Y., Li, W., & Sha, F. (2016). Metric learning for ordinal data. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 30).

Singh, R., & Gulwani, S. (2015). Predicting a correct program in programming by example. In International Conference on Computer Aided Verification (pp. 398–414). Springer.

Singh, R., & Gulwani, S. (2016). Transforming spreadsheet data types using examples. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (pp. 343–356).

Sleeper, R. (2021). Tableau Desktop Pocket Reference. O’Reilly Media Inc.

Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., & Zhang, E. (2022). Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990.

Tamkin, A., Brundage, M., Clark, J., & Ganguli, D. (2021). Understanding the capabilities, limitations, and societal impact of large language models. arXiv preprint arXiv:2102.02503.

Terrizzano, I. G., Schwarz, P. M., Roth, M., & Colino, J. E. (2015). Data wrangling: The challenging journey from the wild to the lake. In CIDR.

Trifacta (2022): Trifacta Wrangler. https://www.trifacta.com

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. arXiv preprint arXiv:1706.03762.

Wei, J., Bosma, M. P., Zhao, V., Guu, K., Yu, A. W., Lester, B., Du, N., Dai, A. M., & Le, Q. V. (2022). Finetuned language models are zero-shot learners. https://openreview.net/forum?id=gEZrGCozdqR

Wu, B., Szekely, P., & Knoblock, C. A. (2012). Learning data transformation rules through examples: Preliminary results. In Information Integration on the Web (p. 8).

Xu, S., Semnani, S. J., Campagna, G., & Lam, M. S. (2020). AutoQA: From databases to QA semantic parsers with only synthetic training data. In EMNLP.

Zeng, W., Ren, X., Su, T., Wang, H., Liao, Y., Wang, Z., Jiang, X., Yang, Z., Wang, K., Zhang, X., & Li, C. (2021). Pangu-$$\alpha$$: Large-scale autoregressive pretrained chinese language models with auto-parallel computation. arXiv preprint arXiv:2104.12369.

Zhang, D., Suhara, Y., Li, J., Hulsebos, M., Demiralp, Ç., & Tan, W. C. (2019). Sato: Contextual semantic type detection in tables. arXiv preprint arXiv:1911.06311.

Zoph, B., Bello, I., Kumar, S., Du, N., Huang, Y., Dean, J., Shazeer, N., & Fedus, W. (2022). Designing effective sparse expert models. arXiv preprint arXiv:2202.08906.

[-]

recommendations

 

Este ítem aparece en la(s) siguiente(s) colección(ones)

Mostrar el registro completo del ítem