Setting Crunchbase for Data Science: Preprocessing, Data Integration and Feature Engineering

Handle

https://riunet.upv.es/handle/10251/148975

Cita bibliográfica

Ferrati, F.; Muffatto, M. (2020). Setting Crunchbase for Data Science: Preprocessing, Data Integration and Feature Engineering. Editorial Universitat Politècnica de València. 221-229. https://doi.org/10.4995/CARMA2020.2020.11633

Titulación

Resumen

[EN] In order to support equity investors in their decision-making process, researchers are exploring the potential of machine learning algorithms to predict the financial success of startup ventures. In this context, a key role is played by the significance of the data used, which should reflect most of the variables considered by investors in their screening and evaluation activity. This paper provides a detailed description of the data management process that can be followed to obtain such a dataset. Using Crunchbase as the main data source, other databases have been integrated to enrich the information content and support the feature engineering process. Specifically, the following sources has been considered: USPTO PatentsView, Kauffman Indicators of Entrepreneurship, Academic Ranking of World Universities, CB Insights ranking of top-investors. The final dataset contains the profiles of 138,637 US-based ventures founded between 2000 and 2019. For each company the elements assessed by equity investors have been analyzed. Among others, the following specific areas were considered for each company: location, industry, founding team, intellectual property and funding round history. Data related to each area have been formalized in a series of features ready to be used in a machine learning context.

Palabras clave

Web data, Internet data, Big data, Qca, Pls, Sem, Conference, Crunchbase, Startup, Investments, Feature engineering, Data mining, Machine learning

ISSN

ISBN

9788490488324

Fuente

DOI

10.4995/CARMA2020.2020.11633

Editorial

Editorial Universitat Politècnica de València