If when we need information, it is available and we can use it, then information is useful. The availability is easy when information has good structure and order, and it is not very large. But this situation is unusual, every time the amount of offered information tends to grow of extreme form, to be unstructured and to show unclear order. The manual structuring or order is unviable because the size we have to handle. So goods information retrieval (IR) systems are useful and even needed. Besides, another important characteristic is that the information appears on distributed way in its natural form, so the IR systems have to work on distributed environment and with parallelization methods.

This doctoral thesis deals all these aspects developing and improving methods to obtain IR systems with improve performances, in retrieval quality and computational efficiency too. Moreover, these methods can work on systems with a distributed nature.

The main objective of IR systems is supply relevant documents and omit irrelevant respect to gived query. IR systems have various important handicaps, emphasizing: polysemy, synonyme; related words (two join words have a concrete meaning and the same words when are separated have other meaning); etc. All these ones are deal in this doctoral thesis.

The development of IR system has four different basic phases: the preprocessing, the modelization, the evaluation and the utilization. The preprocessing presents needed actions to transform documents collection to data structure with documents relevant information. One important part of the doctoral thesis studies this phase, being centered in data and structures reduction, maximizing contained information. The modelization defines the structure and behaviour of IR system and this phase has been the most analyzed and developed phase. This doctoral thesis work about vectorial model, leaving outside other models as probabilistic and boolean. The evaluation determinates IR system quality. In this doctoral thesis we use already defined methods, widely used and tested. These methods are based directly or indirectly in the precision and recall. And the fourth phase is the utilization of the system, the doctoral thesis does not raise this phase.

It exists a very large number of clustering methods in multitude of fields and for an extensive variety of information systems, so we have started from the two main and most important methods of the literature: K-Means and DBSCAN. Later, we have tried to improve their quality and not to lose their functionality and their computational performance. Specifically, we have developed a less sensitive method than the K-Means in respect of the parameters initialization, the a-Bisecting Spherical K-Means. Also, we have developed the VDBSCAN method, which obtains the DBSCAN same clusters, with almost double quickness and eliminating aleatory selection of initialization parameters when we have not enough information about the IR system (it fixes the first parameter on a constant valour and the second is obtained with an heuristic, developed in this doctoral thesis). All this methods have the objective of work on distributed environment, so an important part of the doctoral thesis discusses parallelization aspects.

After the experimental study of information retrieval quality and computational performances we could conclude that VDBSCAN method obtain better quality than a-Bisecting Spherical K-Means method. VDBSCAN have a more expensive modelization, but have a better behaviour in the parallelization. With respect to the evaluation time, a-Bisecting Spherical K-Means is always a little faster than VDBSCAN, but the last one obtains better values to the speed up and efficiency. In conclusion, the VDBSCAN method will always be selected that retrieval quality would be the most important thing. And the a-Bisecting Spherical K-Means method when the modelization phase is repeated very times, because it has a smaller computational cost.