- -

A Self-Enriching Methodology for Clustering Narrow Domain Short Texts

RiuNet: Repositorio Institucional de la Universidad Politécnica de Valencia

Compartir/Enviar a

Citas

Estadísticas

  • Estadisticas de Uso

A Self-Enriching Methodology for Clustering Narrow Domain Short Texts

Mostrar el registro sencillo del ítem

Ficheros en el ítem

dc.contributor.author Pinto, David es_ES
dc.contributor.author Rosso, Paolo es_ES
dc.contributor.author Jiménez-Salazar, Héctor es_ES
dc.date.accessioned 2013-07-03T09:26:42Z
dc.date.issued 2011
dc.identifier.issn 0010-4620
dc.identifier.uri http://hdl.handle.net/10251/30421
dc.description.abstract Clustering narrow domain short texts is considered to be a complex task because of the intrinsic features of the corpus to be clustered: (i) the low frequencies of vocabulary terms in short texts, and (ii) the high vocabulary overlapping associated to narrow domains. The aim of this paper is to introduce a self-term expansion methodology for improving the performance of clustering methods when dealing with corpora of this kind. This methodology allows raw textual data to be enriched by adding co-related terms from an automatically constructed lexical knowledge resource obtained from the same target data set (and not from an external resource). We also propose a set of supervised and unsupervised text assessment measures for evaluating different corpus features, such as shortness, stylometry and domain broadness. With the help of these measures, we may determine beforehand whether or not to use the methodology proposed in this paper. Finally, we integrate all these assessment measures in a freely available web-based system named Watermarking Corpora On-line System, which may be used by computer scientists in order to evaluate the different features associated with a given textual corpus. es_ES
dc.description.sponsorship This work was supported by MICINN project TEXT-ENTERPRISE 2.0 TIN2009-13391-C04-03 (Plan I+D+i) and the CONACYT research project number 106625. en_EN
dc.language Inglés es_ES
dc.publisher Oxford University Press (OUP): Policy A - Oxford Open Option A es_ES
dc.relation.ispartof Computer Journal es_ES
dc.rights Reserva de todos los derechos es_ES
dc.subject Clustering and analysis of textual data es_ES
dc.subject Narrow domain short texts es_ES
dc.subject Natural language processing es_ES
dc.subject Internet tools es_ES
dc.subject.classification LENGUAJES Y SISTEMAS INFORMATICOS es_ES
dc.title A Self-Enriching Methodology for Clustering Narrow Domain Short Texts es_ES
dc.type Artículo es_ES
dc.embargo.lift 10000-01-01
dc.embargo.terms forever es_ES
dc.identifier.doi 10.1093/comjnl/bxq069
dc.relation.projectID info:eu-repo/grantAgreement/MICINN//TIN2009-13391-C04-03/ES/Text-Enterprise 2.0: Tecnicas De Comprension De Textos Aplicadas A Las Necesidades De La Empresa 2.0/ es_ES
dc.relation.projectID info:eu-repo/grantAgreement/CONACYT//106625/ es_ES
dc.rights.accessRights Cerrado es_ES
dc.contributor.affiliation Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació es_ES
dc.description.bibliographicCitation Pinto, D.; Rosso, P.; Jiménez-Salazar, H. (2011). A Self-Enriching Methodology for Clustering Narrow Domain Short Texts. Computer Journal. 54(7):1148-1165. https://doi.org/10.1093/comjnl/bxq069 es_ES
dc.description.accrualMethod S es_ES
dc.relation.publisherversion http://dx.doi.org/10.1093/comjnl/bxq069 es_ES
dc.description.upvformatpinicio 1148 es_ES
dc.description.upvformatpfin 1165 es_ES
dc.type.version info:eu-repo/semantics/publishedVersion es_ES
dc.description.volume 54 es_ES
dc.description.issue 7 es_ES
dc.relation.senia 215390
dc.identifier.eissn 1460-2067
dc.contributor.funder Ministerio de Ciencia e Innovación es_ES
dc.contributor.funder Consejo Nacional de Ciencia y Tecnología, México es_ES


Este ítem aparece en la(s) siguiente(s) colección(ones)

Mostrar el registro sencillo del ítem