Abstract
Most of methods for automatic document categorization based on supervised learning
techniques and consequently, they have the problem of requiring a large number of
training instances. In order to tackle this problem, this thesis proposes a new semi-
supervised method for categorizing documents, which considers the automatic
extraction of unlabelled examples of the Web and its incorporation into the training set.
The unlabeled examples for the training set are selected by a method based on machine
learning. This incremental model only allows a selection of the best examples that are
not labeled in each one of the iterations. However, in some domains this technique
improves the accuracy of classification, especially when labeled data are sparse. That is,
the more respect they have the examples labeled with the category to which they
belong, they will get better results with this method. This method is independent of the
domain and language, its operation is more appropriate in those scenarios in which there
is not enough manually tagged training instances. The experimental evaluation of the
method was carried out with three experiments using thematic categorization of
documents (using collections of documents with different characteristics, such as: very
few examples of training and a high degree of overlap) and not thematic (authorship
attribution). A fourth experiment was carried out for the word sense disambiguation
task. The results in each of those experiments allow us to see the effectiveness of
incorporating unlabeled data downloaded from the Web to the training set.