Resumen:
|
[EN] Machine learning has becoming a trending topic in the last years, being
now one of the most demanding careers in computer science. This
growing has lead to more complex models capable of driving a car or
cancer ...[+]
[EN] Machine learning has becoming a trending topic in the last years, being
now one of the most demanding careers in computer science. This
growing has lead to more complex models capable of driving a car or
cancer detection, however this models improvements are also thanks
to the improvements in computational power. In this study we investigate
a data exploration technique for creating synthetic data, a field
of Machine learning that does not have as much improvements in the
last years. Our project comes from a industrial process where data is a
valuable asset, this process has both computational power and power
full models but struggles with the availability of the data. In response
for this a model for generating data is proposed, aiming to fill the lack
of data during data exploration and training of this industrial process.
This model consist of a Hidden Markov Model where states represent
different distributions the data follows, data is created by traveling
through this states with an algorithm that uses the prior distribution
of these states in a Dirichlet distribution.
The method to infer data distributions from the given data and create
this Hidden Markov Model model has been explained along with
the technique used to travel between states. Results have been presented
showing how the data inferring performed and how the synthetic
data reproduces the original one, taking special care for the reproduction
of specific features in the original data. To get a better perspective
of the data we created we tricked the states for our model,
creating data from all of the states or from the states with less prior
probability. Results showed that the model is capable of creating data
similar to the real one but it struggled with data with a small amount
of significant outliers. In conclusion a model to create reliable data
have been introduced along with a list of possible improvements.
[-]
[ES] Hoy en día un gran número de empresas están integrando técnicas de
aprendizaje computacional en su modelo de negocio. Estas técnicas
requieren de grandes cantidades de datos, lo cual no es siempre el
caso para ...[+]
[ES] Hoy en día un gran número de empresas están integrando técnicas de
aprendizaje computacional en su modelo de negocio. Estas técnicas
requieren de grandes cantidades de datos, lo cual no es siempre el
caso para todas las empresas. Para dar solución a este problema real,
se plantea la posibilidad de desarrollar un generador de datos
sintéticos. En este proyecto se estudia un método fiable para generar
datos sintéticos a partir de una colección de datos ya existente, de
forma que los datos sintéticos se asemejen a la colección de datos
dada. Para ello analizaremos las distribuciones estadísticas que
siguen la colección de datos dada y estimaremos sus parámetros para
crear un modelo oculto de Markov en el que cada estado contendrá los
parámetros de cada distribución, que será inferida a partir de la
colección de datos dada. Una vez derivado el modelo oculto de Markov
se desarrollará un algoritmo de generación de datos sintéticos que
transite entre los estados del modelos de Markov entrenado. Por
último, analizaremos la validez de estos datos sintéticos con un
detector de comportamiento anormal basado en máquinas de vectores
soporte y/o redes neuronales recurrentes.
[-]
|