- -

Silhouette + Attraction: A Simple and Effective Method for Text Clustering

RiuNet: Repositorio Institucional de la Universidad Politécnica de Valencia

Compartir/Enviar a

Citas

Estadísticas

  • Estadisticas de Uso

Silhouette + Attraction: A Simple and Effective Method for Text Clustering

Mostrar el registro sencillo del ítem

Ficheros en el ítem

dc.contributor.author Errecalde, Marcelo es_ES
dc.contributor.author Cagnina, Leticia es_ES
dc.contributor.author Rosso, Paolo es_ES
dc.date.accessioned 2016-05-17T10:36:50Z
dc.date.available 2016-05-17T10:36:50Z
dc.date.issued 2015-08
dc.identifier.issn 1351-3249
dc.identifier.uri http://hdl.handle.net/10251/64228
dc.description.abstract [EN] This article presents silhouette attraction (Sil Att), a simple and effective method for text clustering, which is based on two main concepts: the silhouette coefficient and the idea of attraction. The combination of both principles allows us to obtain a general technique that can be used either as a boosting method, which improves results of other clustering algorithms, or as an independent clustering algorithm. The experimental work shows that Sil Att is able to obtain high-quality results on text corpora with very different characteristics. Furthermore, its stable performance on all the considered corpora is indicative that it is a very robust method. This is a very interesting positive aspect of Sil Att with respect to the other algorithms used in the experiments, whose performances heavily depend on specific characteristics of the corpora being considered. es_ES
dc.description.sponsorship This research work has been partially funded by UNSL, CONICET (Argentina), DIANA-APPLICATIONS-Finding Hidden Knowledge in Texts: Applications (TIN2012-38603-C02-01) research project, and the WIQ-EI IRSES project (grant no. 269180) within the FP 7 Marie Curie People Framework on Web Information Quality Evaluation Initiative. The work of the third author was done also in the framework of the VLC/CAMPUS Microcluster on Multimodal Interaction in Intelligent Systems. en_EN
dc.language Inglés es_ES
dc.publisher Cambridge University Press (CUP): STM Journals es_ES
dc.relation.ispartof Natural Language Engineering es_ES
dc.rights Reserva de todos los derechos es_ES
dc.subject.classification LENGUAJES Y SISTEMAS INFORMATICOS es_ES
dc.title Silhouette + Attraction: A Simple and Effective Method for Text Clustering es_ES
dc.type Artículo es_ES
dc.identifier.doi 10.1017/S1351324915000273
dc.relation.projectID info:eu-repo/grantAgreement/EC/FP7/269180/EU/Web Information Quality Evaluation Initiative/ es_ES
dc.relation.projectID info:eu-repo/grantAgreement/MINECO//TIN2012-38603-C02-01/ES/DIANA-APPLICATIONS: FINDING HIDDEN KNOWLEDGE IN TEXTS: APPLICATIONS/ es_ES
dc.rights.accessRights Abierto es_ES
dc.contributor.affiliation Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació es_ES
dc.description.bibliographicCitation Errecalde, M.; Cagnina, L.; Rosso, P. (2015). Silhouette + Attraction: A Simple and Effective Method for Text Clustering. Natural Language Engineering. 1-40. https://doi.org/10.1017/S1351324915000273 es_ES
dc.description.accrualMethod S es_ES
dc.relation.publisherversion http://dx.doi.org/10.1017/S1351324915000273 es_ES
dc.description.upvformatpinicio 1 es_ES
dc.description.upvformatpfin 40 es_ES
dc.type.version info:eu-repo/semantics/publishedVersion es_ES
dc.relation.senia 306218 es_ES
dc.contributor.funder European Commission
dc.contributor.funder Ministerio de Economía y Competitividad es_ES
dc.description.references Zhao, Y., & Karypis, G. (2004). Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering. Machine Learning, 55(3), 311-331. doi:10.1023/b:mach.0000027785.44527.d6 es_ES
dc.description.references Tu, L., & Chen, Y. (2009). Stream data clustering based on grid density and attraction. ACM Transactions on Knowledge Discovery from Data, 3(3), 1-27. doi:10.1145/1552303.1552305 es_ES
dc.description.references Yang, T., Jin, R., Chi, Y., & Zhu, S. (2009). Combining link and content for community detection. Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’09. doi:10.1145/1557019.1557120 es_ES
dc.description.references Zhao, Y., Karypis, G., & Fayyad, U. (2005). Hierarchical Clustering Algorithms for Document Datasets. Data Mining and Knowledge Discovery, 10(2), 141-168. doi:10.1007/s10618-005-0361-3 es_ES
dc.description.references Kaufman, L., & Rousseeuw, P. J. (Eds.). (1990). Finding Groups in Data. Wiley Series in Probability and Statistics. doi:10.1002/9780470316801 es_ES
dc.description.references Karypis, G., Eui-Hong Han, & Kumar, V. (1999). Chameleon: hierarchical clustering using dynamic modeling. Computer, 32(8), 68-75. doi:10.1109/2.781637 es_ES
dc.description.references Cagnina, L., Errecalde, M., Ingaramo, D., & Rosso, P. (2014). An efficient Particle Swarm Optimization approach to cluster short texts. Information Sciences, 265, 36-49. doi:10.1016/j.ins.2013.12.010 es_ES
dc.description.references He, H., Chen, B., Xu, W., & Guo, J. (2007). Short Text Feature Extraction and Clustering for Web Topic Mining. Third International Conference on Semantics, Knowledge and Grid (SKG 2007). doi:10.1109/skg.2007.76 es_ES
dc.description.references Spearman, C. (1904). The Proof and Measurement of Association between Two Things. The American Journal of Psychology, 15(1), 72. doi:10.2307/1412159 es_ES
dc.description.references Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53-65. doi:10.1016/0377-0427(87)90125-7 es_ES
dc.description.references Manning, C. D., Raghavan, P., & Schutze, H. (2008). Introduction to Information Retrieval. doi:10.1017/cbo9780511809071 es_ES
dc.description.references Qi, G.-J., Aggarwal, C. C., & Huang, T. (2012). Community Detection with Edge Content in Social Media Networks. 2012 IEEE 28th International Conference on Data Engineering. doi:10.1109/icde.2012.77 es_ES
dc.description.references Daxin Jiang, Jian Pei, & Aidong Zhang. (s. f.). DHC: a density-based hierarchical clustering method for time series gene expression data. Third IEEE Symposium on Bioinformatics and Bioengineering, 2003. Proceedings. doi:10.1109/bibe.2003.1188978 es_ES
dc.description.references Charikar, M., Chekuri, C., Feder, T., & Motwani, R. (2004). Incremental Clustering and Dynamic Information Retrieval. SIAM Journal on Computing, 33(6), 1417-1440. doi:10.1137/s0097539702418498 es_ES
dc.description.references Selim, S. Z., & Alsultan, K. (1991). A simulated annealing algorithm for the clustering problem. Pattern Recognition, 24(10), 1003-1008. doi:10.1016/0031-3203(91)90097-o es_ES
dc.description.references Aranganayagi, S., & Thangavel, K. (2007). Clustering Categorical Data Using Silhouette Coefficient as a Relocating Measure. International Conference on Computational Intelligence and Multimedia Applications (ICCIMA 2007). doi:10.1109/iccima.2007.328 es_ES
dc.description.references Makagonov, P., Alexandrov, M., & Gelbukh, A. (2004). Clustering Abstracts Instead of Full Texts. Lecture Notes in Computer Science, 129-135. doi:10.1007/978-3-540-30120-2_17 es_ES
dc.description.references Jing L. 2005. Survey of text clustering. Technical report. Department of Mathematics. The University of Hong Kong, Hong Kong, China. es_ES
dc.description.references Shannon, C. E. (1948). A Mathematical Theory of Communication. Bell System Technical Journal, 27(3), 379-423. doi:10.1002/j.1538-7305.1948.tb01338.x es_ES
dc.description.references Hearst, M. A. (2006). Clustering versus faceted categories for information exploration. Communications of the ACM, 49(4), 59. doi:10.1145/1121949.1121983 es_ES
dc.description.references Alexandrov, M., Gelbukh, A., & Rosso, P. (2005). An Approach to Clustering Abstracts. Lecture Notes in Computer Science, 275-285. doi:10.1007/11428817_25 es_ES
dc.description.references Dos Santos, J. B., Heuser, C. A., Moreira, V. P., & Wives, L. K. (2011). Automatic threshold estimation for data matching applications. Information Sciences, 181(13), 2685-2699. doi:10.1016/j.ins.2010.05.029 es_ES
dc.description.references Hasan, M. A., Chaoji, V., Salem, S., & Zaki, M. J. (2009). Robust partitional clustering by outlier and density insensitive seeding. Pattern Recognition Letters, 30(11), 994-1002. doi:10.1016/j.patrec.2009.04.013 es_ES
dc.description.references Dunn†, J. C. (1974). Well-Separated Clusters and Optimal Fuzzy Partitions. Journal of Cybernetics, 4(1), 95-104. doi:10.1080/01969727408546059 es_ES
dc.description.references Carullo, M., Binaghi, E., & Gallo, I. (2009). An online document clustering technique for short web contents. Pattern Recognition Letters, 30(10), 870-876. doi:10.1016/j.patrec.2009.04.001 es_ES
dc.description.references Kruskal, W. H., & Wallis, W. A. (1952). Use of Ranks in One-Criterion Variance Analysis. Journal of the American Statistical Association, 47(260), 583-621. doi:10.1080/01621459.1952.10483441 es_ES
dc.description.references Bezdek, J. C., & Pal, N. R. (s. f.). Cluster validation with generalized Dunn’s indices. Proceedings 1995 Second New Zealand International Two-Stream Conference on Artificial Neural Networks and Expert Systems. doi:10.1109/annes.1995.499469 es_ES
dc.description.references Brun, M., Sima, C., Hua, J., Lowey, J., Carroll, B., Suh, E., & Dougherty, E. R. (2007). Model-based evaluation of clustering validation measures. Pattern Recognition, 40(3), 807-824. doi:10.1016/j.patcog.2006.06.026 es_ES
dc.description.references Davies, D. L., & Bouldin, D. W. (1979). A Cluster Separation Measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-1(2), 224-227. doi:10.1109/tpami.1979.4766909 es_ES
dc.description.references Pinto, D., & Rosso, P. (s. f.). On the Relative Hardness of Clustering Corpora. Lecture Notes in Computer Science, 155-161. doi:10.1007/978-3-540-74628-7_22 es_ES
dc.description.references Pons-Porrata, A., Berlanga-Llavori, R., & Ruiz-Shulcloper, J. (2007). Topic discovery based on text mining techniques. Information Processing & Management, 43(3), 752-768. doi:10.1016/j.ipm.2006.06.001 es_ES
dc.description.references Pinto, D., Benedí, J.-M., & Rosso, P. (2007). Clustering Narrow-Domain Short Texts by Using the Kullback-Leibler Distance. Lecture Notes in Computer Science, 611-622. doi:10.1007/978-3-540-70939-8_54 es_ES


Este ítem aparece en la(s) siguiente(s) colección(ones)

Mostrar el registro sencillo del ítem