Topic Models and Fusion Methods: a Union to Improve Text Clustering and Cluster Labeling

Topic modeling algorithms are statistical methods that aim to discover the topics running through the text documents. Using topic models in machine learning and text mining is popular due to its applicability in inferring the latent topic structure of a corpus. In this paper, we represent an enriching document approach, using state-ofthe-art topic models and data fusion methods, to enrich documents of a collection with the aim of improving the quality of text clustering and cluster labeling. We propose a bi-vector space model in which every document of the corpus is represented by two vectors: one is generated based on the fusion-based topic modeling approach, and one simply is the traditional vector model. Our experiments on various datasets show that using a combination of topic modeling and fusion methods to create documents’ vectors can significantly improve the quality of the results in clustering the documents.

[1]  Youngjoong Ko,et al.  A study of term weighting schemes using class information for text classification , 2012, SIGIR '12.

[2]  Yue Lu,et al.  Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA , 2011, Information Retrieval.

[3]  Shay Hummel,et al.  A fusion approach to cluster labeling , 2014, SIGIR.

[4]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Data stream clustering: A survey , 2013, CSUR.

[5]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[6]  Steffen Staab,et al.  Ontologies improve text document clustering , 2003, Third IEEE International Conference on Data Mining.

[7]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[8]  James P. Callan,et al.  Automatically labeling hierarchical clusters , 2006, DG.O.

[9]  Shengli Wu,et al.  Data Fusion in Information Retrieval , 2012, Adaptation, Learning, and Optimization.

[10]  Duc-Thuan Vo,et al.  Learning to classify short text from scientific documents using topic models with various types of knowledge , 2015, Expert Syst. Appl..

[11]  Zhiyuan Liu,et al.  Topical Word Embeddings , 2015, AAAI.

[12]  B. S. Harish,et al.  A New Feature Selection Method based on Intuitionistic Fuzzy Entropy to Categorize Text Documents , 2018, Int. J. Interact. Multim. Artif. Intell..

[13]  Somnath Banerjee,et al.  Clustering short texts using wikipedia , 2007, SIGIR.

[14]  David Carmel,et al.  Enhancing cluster labeling using wikipedia , 2009, SIGIR.

[15]  David R. Karger,et al.  Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections , 2017, SIGF.

[16]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[17]  Alan L. Porter,et al.  Clustering scientific documents with topic modeling , 2014, Scientometrics.

[18]  Heng Zhang,et al.  Improving short text classification by learning vector representations of both words and hidden topics , 2016, Knowl. Based Syst..

[19]  Derek Greene,et al.  Practical solutions to the problem of diagonal dominance in kernel document clustering , 2006, ICML.

[20]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.