Setting Crunchbase for Data Science: Preprocessing, Data Integration and Feature Engineering
Fecha
Autores
Ferrati, Francesco
Muffatto, Moreno
Directores
Unidades organizativas
Handle
https://riunet.upv.es/handle/10251/148975
Cita bibliográfica
Ferrati, F.; Muffatto, M. (2020). Setting Crunchbase for Data Science: Preprocessing, Data Integration and Feature Engineering. Editorial Universitat Politècnica de València. 221-229. https://doi.org/10.4995/CARMA2020.2020.11633
Titulación
Resumen
[EN] In order to support equity investors in their decision-making process,
researchers are exploring the potential of machine learning algorithms to
predict the financial success of startup ventures. In this context, a key role is
played by the significance of the data used, which should reflect most of the
variables considered by investors in their screening and evaluation activity.
This paper provides a detailed description of the data management process
that can be followed to obtain such a dataset. Using Crunchbase as the main
data source, other databases have been integrated to enrich the information
content and support the feature engineering process. Specifically, the
following sources has been considered: USPTO PatentsView, Kauffman
Indicators of Entrepreneurship, Academic Ranking of World Universities, CB
Insights ranking of top-investors. The final dataset contains the profiles of
138,637 US-based ventures founded between 2000 and 2019. For each
company the elements assessed by equity investors have been analyzed. Among
others, the following specific areas were considered for each company:
location, industry, founding team, intellectual property and funding round
history. Data related to each area have been formalized in a series of features
ready to be used in a machine learning context.
Palabras clave
Web data, Internet data, Big data, Qca, Pls, Sem, Conference, Crunchbase, Startup, Investments, Feature engineering, Data mining, Machine learning
ISSN
ISBN
9788490488324
Fuente
DOI
10.4995/CARMA2020.2020.11633
Editorial
Editorial Universitat Politècnica de València