Abstract Most of methods for automatic document categorization based on supervised learning techniques and consequently, they have the problem of requiring a large number of training instances. In order to tackle this problem, this thesis proposes a new semi- supervised method for categorizing documents, which considers the automatic extraction of unlabelled examples of the Web and its incorporation into the training set. The unlabeled examples for the training set are selected by a method based on machine learning. This incremental model only allows a selection of the best examples that are not labeled in each one of the iterations. However, in some domains this technique improves the accuracy of classification, especially when labeled data are sparse. That is, the more respect they have the examples labeled with the category to which they belong, they will get better results with this method. This method is independent of the domain and language, its operation is more appropriate in those scenarios in which there is not enough manually tagged training instances. The experimental evaluation of the method was carried out with three experiments using thematic categorization of documents (using collections of documents with different characteristics, such as: very few examples of training and a high degree of overlap) and not thematic (authorship attribution). A fourth experiment was carried out for the word sense disambiguation task. The results in each of those experiments allow us to see the effectiveness of incorporating unlabeled data downloaded from the Web to the training set.