Clustering Narrow-Domain Short Texts Using K-Means, Linguistic Patterns and LSI

In the present work we consider the problem of narrow-domain clustering of short texts, such as academic abstracts. Our main objective is to check whether it is possible to improve the quality of k-means algorithm expanding the feature space by adding a dictionary of word groups that were selected from texts on the basis of a fixed set of patterns. Also, we check the possibility to increase the quality of clustering by mapping the feature spaces to a semantic space with a lower dimensionality using Latent Semantic Indexing (LSI). The results allow us to assume that the aforementioned modifications are feasible in practical terms as compared to the use of k-means in the feature space defined only by the main dictionary of the corpus.

[1]  Dmitry Mouromtsev,et al.  Sci-Search: Academic Search and Analysis System Based on Keyphrases , 2013, KESW.

[2]  Paolo Rosso,et al.  A Self-enriching Methodology for Clustering Narrow Domain Short Texts , 2011, Comput. J..

[3]  Timothy Baldwin,et al.  Automatic keyphrase extraction from scientific articles , 2013, Lang. Resour. Evaluation.

[4]  Benno Stein,et al.  On Cluster Validity and the Information Need of Users , 2003 .

[5]  Paolo Rosso,et al.  ITSA * : An Effective Iterative Method for Short-Text Clustering Tasks , 2010, IEA/AIE.

[6]  Paolo Rosso,et al.  An Approach to Clustering Abstracts , 2005, NLDB.

[7]  Paolo Rosso,et al.  Clustering Abstracts of Scientific Texts Using the Transition Point Technique , 2006, CICLing.

[8]  Claudio Carpineto,et al.  Full-Subtopic Retrieval with Keyphrase-Based Search Results Clustering , 2009, 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology.

[9]  Hamid Alinejad Rokny,et al.  Text clustering on latent semantic indexing with particle swarm optimization (PSO) algorithm , 2012 .

[10]  Alexander Gelbukh,et al.  Computational Linguistics and Intelligent Text Processing , 2015, Lecture Notes in Computer Science.

[11]  Wei-Ying Ma,et al.  Learning to cluster web search results , 2004, SIGIR '04.

[12]  Pavel Klinov,et al.  Knowledge Engineering and the Semantic Web , 2014, Communications in Computer and Information Science.

[13]  Dell Zhang,et al.  Semantic, Hierarchical, Online Clustering of Web Search Results , 2004, APWeb.

[14]  Yanchun Zhang,et al.  Advanced Web Technologies and Applications , 2004, Lecture Notes in Computer Science.

[15]  Paolo Rosso,et al.  A DISCRETE PARTICLE SWARM OPTIMIZER FOR CLUSTERING SHORT-TEXT CORPORA , 2008 .

[16]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[17]  Benno Stein,et al.  Analysis of Clustering Algorithms for Web-Based Search , 2002, PAKM.

[18]  Nicolás García-Pedrajas,et al.  Trends in Applied Intelligent Systems - 23rd International Conference on Industrial Engineering and Other Applications of Applied Intelligent Systems, IEA/AIE 2010, Cordoba, Spain, June 1-4, 2010, Proceedings, Part I , 2010, IEA/AIE.

[19]  Maguelonne Teisseire,et al.  Natural Language Processing and Information Systems , 2014, Lecture Notes in Computer Science.