Abstract

The development of predictive mathematical models of different types 
of phenomena is one of the main applications of Data Mining 
Techniques. The simple methods available for the development of 
prediction models frequently reduce the complexity of the problem at 
the price of sacrificing precision in the answer. In view of this, there 
exists the possibility of applying the techniques of Data Mining, which 
bearing in mind the vast amount of data, explore the relationships and 
co-relationships between the different descriptive variables of 
phenomena and a series of observations thereof, looking for patterns of 
behaviour with the aim of seeking a prediction model.

This is the case in the modelling of air contamination elements. 
Atmospheric contamination is a phenomenon that behaves in an 
extremely non-lineal and multi-varied way, whose study requires the 
availability of large scale data matrixes, making complex tools for analysis 
and management of data necessary.

Occasionally one uses various methods, some of which are objective, 
some of which are somewhat subjective seeking a balance between 
existing strengths and weaknesses in the different tools, as well as 
evaluating (often through methods of trial and error) different horizons 
of prediction anticipation time, to play with the presence or absence of 
the variables involved, with the amount of data, with different groups of 
the same, etc.

In the specific case of this thesis, the developed prediction models are 
focused on the prediction of the average value of Fine Particles Matter 
(PM2.5) present in the breathable air with a time of anticipation of 8 
hours and of Maximum Tropospheric Ozone (O3) with 24 hours of 
anticipation. 

An interesting set of techniques of prediction were employed, starting 
off of with simple tools of parametric nature such as Persistence, Linear 
Modeling Multivariable, as well as the semi-parametric technique named  
“Ridge regression”, in addition to tools of nonparametric nature like 
Artificial Neural Networks (ANN) and Support Vector Machines (SVM).

Given our advance knowledge of the highly nonlinear nature of the 
modelled polluting agents, the parametric techniques had the purpose of 
establishing maximum limits of error in the prediction and to be 
important comparative references with respect to the rest of the 
developed models. 

One significant result of the work was to achieve better prediction 
models than those available in the bibliographic literature, applying the 
tools of Artificial Neural Networks such as Multi-layer Perceptron 
(MLP), Square Multi-Layer Perceptron (SMLP), Radial Base Function 
(RBF) and Elman Networks, as well as Support Vector Machine (SVM).