Mostrar el registro sencillo del ítem
dc.contributor.author | Pinto, David | es_ES |
dc.contributor.author | Rosso, Paolo | es_ES |
dc.contributor.author | Jiménez-Salazar, Héctor | es_ES |
dc.date.accessioned | 2013-07-03T09:26:42Z | |
dc.date.issued | 2011 | |
dc.identifier.issn | 0010-4620 | |
dc.identifier.uri | http://hdl.handle.net/10251/30421 | |
dc.description.abstract | Clustering narrow domain short texts is considered to be a complex task because of the intrinsic features of the corpus to be clustered: (i) the low frequencies of vocabulary terms in short texts, and (ii) the high vocabulary overlapping associated to narrow domains. The aim of this paper is to introduce a self-term expansion methodology for improving the performance of clustering methods when dealing with corpora of this kind. This methodology allows raw textual data to be enriched by adding co-related terms from an automatically constructed lexical knowledge resource obtained from the same target data set (and not from an external resource). We also propose a set of supervised and unsupervised text assessment measures for evaluating different corpus features, such as shortness, stylometry and domain broadness. With the help of these measures, we may determine beforehand whether or not to use the methodology proposed in this paper. Finally, we integrate all these assessment measures in a freely available web-based system named Watermarking Corpora On-line System, which may be used by computer scientists in order to evaluate the different features associated with a given textual corpus. | es_ES |
dc.description.sponsorship | This work was supported by MICINN project TEXT-ENTERPRISE 2.0 TIN2009-13391-C04-03 (Plan I+D+i) and the CONACYT research project number 106625. | en_EN |
dc.language | Inglés | es_ES |
dc.publisher | Oxford University Press (OUP): Policy A - Oxford Open Option A | es_ES |
dc.relation.ispartof | Computer Journal | es_ES |
dc.rights | Reserva de todos los derechos | es_ES |
dc.subject | Clustering and analysis of textual data | es_ES |
dc.subject | Narrow domain short texts | es_ES |
dc.subject | Natural language processing | es_ES |
dc.subject | Internet tools | es_ES |
dc.subject.classification | LENGUAJES Y SISTEMAS INFORMATICOS | es_ES |
dc.title | A Self-Enriching Methodology for Clustering Narrow Domain Short Texts | es_ES |
dc.type | Artículo | es_ES |
dc.embargo.lift | 10000-01-01 | |
dc.embargo.terms | forever | es_ES |
dc.identifier.doi | 10.1093/comjnl/bxq069 | |
dc.relation.projectID | info:eu-repo/grantAgreement/MICINN//TIN2009-13391-C04-03/ES/Text-Enterprise 2.0: Tecnicas De Comprension De Textos Aplicadas A Las Necesidades De La Empresa 2.0/ | es_ES |
dc.relation.projectID | info:eu-repo/grantAgreement/CONACYT//106625/ | es_ES |
dc.rights.accessRights | Cerrado | es_ES |
dc.contributor.affiliation | Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació | es_ES |
dc.description.bibliographicCitation | Pinto, D.; Rosso, P.; Jiménez-Salazar, H. (2011). A Self-Enriching Methodology for Clustering Narrow Domain Short Texts. Computer Journal. 54(7):1148-1165. https://doi.org/10.1093/comjnl/bxq069 | es_ES |
dc.description.accrualMethod | S | es_ES |
dc.relation.publisherversion | http://dx.doi.org/10.1093/comjnl/bxq069 | es_ES |
dc.description.upvformatpinicio | 1148 | es_ES |
dc.description.upvformatpfin | 1165 | es_ES |
dc.type.version | info:eu-repo/semantics/publishedVersion | es_ES |
dc.description.volume | 54 | es_ES |
dc.description.issue | 7 | es_ES |
dc.relation.senia | 215390 | |
dc.identifier.eissn | 1460-2067 | |
dc.contributor.funder | Ministerio de Ciencia e Innovación | es_ES |
dc.contributor.funder | Consejo Nacional de Ciencia y Tecnología, México | es_ES |