Local theme detection and annotation with keywords for narrow and wide domain short text collections

This paper presents a clustering approach for text collections and automatic detection of topic and keywords for clusters. Present research focuses on narrow domain short texts such as short news and scientific paper abstracts. We propose a term selection method, which helps to significantly improve hierarchic clustering quality, and also the automatic algorithm to annotate clusters with keywords and topic names. The results of clustering are good comparing with the results of other approaches and our algorithm also allows extracting keywords for each cluster, using the information about the size of a cluster and word frequencies in documents. Keywords-narrow domain short text clustering; automatic annotation; hierarchical clustering; Pearson correlation.

[1]  Paolo Rosso,et al.  A Self-enriching Methodology for Clustering Narrow Domain Short Texts , 2011, Comput. J..

[2]  Roger Wattenhofer,et al.  BuzzTrack: topic detection and tracking in email , 2007, IUI '07.

[3]  Juha Makkonen,et al.  Semantic Classes in Topic Detection and Tracking , 2009 .

[4]  Paolo Rosso,et al.  An Approach to Clustering Abstracts , 2005, NLDB.

[5]  Paolo Rosso,et al.  A new AntTree-based algorithm for clustering short-text corpora , 2010 .

[6]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[7]  M. Andrea Rodríguez,et al.  Clustering-Based Searching and Navigation in an Online News Source , 2006, ECIR.

[8]  David Eduardo Pinto Avendaño,et al.  PHD STUDENT , 2022 .

[9]  Paolo Rosso,et al.  Clustering Weblogs on the Basis of a Topic Detection Method , 2010, MCPR.

[10]  Paolo Rosso,et al.  Clustering Abstracts of Scientific Texts Using the Transition Point Technique , 2006, CICLing.

[11]  Paolo Rosso,et al.  A DISCRETE PARTICLE SWARM OPTIMIZER FOR CLUSTERING SHORT-TEXT CORPORA , 2008 .

[12]  Chinatsu Aone,et al.  Fast and effective text mining using linear-time document clustering , 1999, KDD '99.

[13]  Paolo Rosso,et al.  ITSA * : An Effective Iterative Method for Short-Text Clustering Tasks , 2010, IEA/AIE.

[14]  Mireya Tovar,et al.  BUAP: An Unsupervised Approach to Automatic Keyphrase Extraction from Scientific Articles , 2010, SemEval@ACL.

[15]  Jian Ma,et al.  Topic detection and organization of mobile text messages , 2010, CIKM '10.