Abstract
The development of predictive mathematical models of different types
of phenomena is one of the main applications of Data Mining
Techniques. The simple methods available for the development of
prediction models frequently reduce the complexity of the problem at
the price of sacrificing precision in the answer. In view of this, there
exists the possibility of applying the techniques of Data Mining, which
bearing in mind the vast amount of data, explore the relationships and
co-relationships between the different descriptive variables of
phenomena and a series of observations thereof, looking for patterns of
behaviour with the aim of seeking a prediction model.
This is the case in the modelling of air contamination elements.
Atmospheric contamination is a phenomenon that behaves in an
extremely non-lineal and multi-varied way, whose study requires the
availability of large scale data matrixes, making complex tools for analysis
and management of data necessary.
Occasionally one uses various methods, some of which are objective,
some of which are somewhat subjective seeking a balance between
existing strengths and weaknesses in the different tools, as well as
evaluating (often through methods of trial and error) different horizons
of prediction anticipation time, to play with the presence or absence of
the variables involved, with the amount of data, with different groups of
the same, etc.
In the specific case of this thesis, the developed prediction models are
focused on the prediction of the average value of Fine Particles Matter
(PM2.5) present in the breathable air with a time of anticipation of 8
hours and of Maximum Tropospheric Ozone (O3) with 24 hours of
anticipation.
An interesting set of techniques of prediction were employed, starting
off of with simple tools of parametric nature such as Persistence, Linear
Modeling Multivariable, as well as the semi-parametric technique named
“Ridge regression”, in addition to tools of nonparametric nature like
Artificial Neural Networks (ANN) and Support Vector Machines (SVM).
Given our advance knowledge of the highly nonlinear nature of the
modelled polluting agents, the parametric techniques had the purpose of
establishing maximum limits of error in the prediction and to be
important comparative references with respect to the rest of the
developed models.
One significant result of the work was to achieve better prediction
models than those available in the bibliographic literature, applying the
tools of Artificial Neural Networks such as Multi-layer Perceptron
(MLP), Square Multi-Layer Perceptron (SMLP), Radial Base Function
(RBF) and Elman Networks, as well as Support Vector Machine (SVM).