Abstract In this Ph.D. thesis we investigate the problem of clustering a particular set of documents namely narrow domain short texts. To achieve this goal, we have analysed datasets and clustering methods. Moreover, we have introduced some corpus evaluation measures, term selection techniques and clustering validity measures in order to study the following problems: - To determine the relative hardness of a corpus to be clustered and to study some of its features such as shortness, domain broadness, stylometry, class imbalance and structure. - To improve the state of the art of clustering narrow domain short-text corpora. The research work we have carried out is partially focused on "short-text clustering". We consider this issue to be quite relevant, given the current and future way people use "small-language" (e.g. blogs, snippets, news and text-message generation such as email or chat). Moreover, we study the domain broadness of corpora. A corpus may be considered to be narrow or wide domain if the level of the document vocabulary overlapping is high or low, respectively. In the categorization task, it is very difficult to deal with narrow domain corpora such as scientific papers, technical reports, patents, etc. The aim of this research work is to study possible strategies to tackle the following two problems: a) the low frequencies of vocabulary terms in short texts, and b) the high vocabulary overlapping associated to narrow domains. Each problem alone is challenging enough, however, dealing with narrow domain short texts increases the complexity of the problem significantly. The clustering of scientific abstracts is even more difficult than the clustering of narrow domain short-text corpora. The reason is that texts belonging to scientific papers often make use of sequences of words such as "in this paper we present", "the aim is", "the results", etc., which obviously increase the level of similarity among the short-text collections. However, the correct selection of terms when clustering texts is very important because the results may vary significantly. The purpose of studying scientific abstracts is not only due to their specific high complexity, but also because most digital libraries and other web-based repositories of scientific and technical information provide free access only to abstracts and not to the full texts of the documents. Due to the dynamic aspect of research, new interests could arise in a field and new sub-topics need to be discovered through clustering in order to be introduced later as new categories. Therefore, the clustering of abstracts becomes a real necessity. In this thesis, we deal with the treatment of narrow domain short-text collections in three areas: evaluation, clustering and validation of corpora. The major contributions of the investigations carried out are: - The study and introduction of evaluation measures to analyse the following features of a corpus: shortness, domain broadness, class imbalance, stylometry and structure - The development of the Watermarking Corpora On-line System, named WaCOS, for the assessment of corpus features - A new unsupervised methodology (which does not use any external knowledge resource) for dealing with narrow domain short-text corpora. This methodology suggests first applying self-term expansion and then term selection. We analysed different corpus features as evidence of the relative hardness of a given corpus with respect to clustering algorithms. In particular, the degree of shortness, domain broadness, class imbalance, stylometry and structure were studied. We introduced some (un)supervised measures in order to assess these features. The supervised measures were used both to evaluate the corpus features and, even more importantly, to assess the gold standard provided by experts for the corpus to be clustered. The unsupervised measures evaluate the document collections directly (i.e., without any gold standard) and, therefore, they may also be used for other purposes, for instance, to adjust clustering methods while being executed in order to improve the results. The most successful measures were compiled in a freely functional web-based system that allows linguistics and computational linguistics researchers to easily assess the quality of corpora with respect to the aforementioned features. The experiments conducted confirmed that the clustering of narrow domain short-text corpora is a very challenging task. However, the contributions of this research work are proof that it is possible to deal with this difficult problem as well as improve the results obtained with classical techniques and methods.