Improving Topic Coherence Using Parsimonious Language Model and Latent Semantic Indexing

The topic model includes statistical modeling of unstructured data. The critical requirement for efficient topic model includes the removal of unrelated words that might lead to specious coexistence of the unrelated words infiltrating the topic models. To obtain a better quality of topic for the documents, pre-processing is an essential step to remove stop words. However, apart from stop words, there are many non-informatics words present in the corpus that affect the quality of topic models. In an attempt to create efficient topic model, this paper proposes an efficient strategy for the creation of much more topic coherence by using Parsimonious Language Model as a pre-processing framework. Parsimonious Language Model extracts the document specific term and expels the general words. To evaluate the performance of topic models, this paper computes the Topic Coherence by estimating the distance between vectors. In this paper, 20 Newsgroup dataset has been used to perform experiments. It has been observed that proposed methodology generates more coherent results as compared to other baseline models.

[1]  David J. Miller,et al.  Parsimonious Topic Models with Salient Word Discovery , 2014, IEEE Transactions on Knowledge and Data Engineering.

[2]  Ruslan Salakhutdinov,et al.  Evaluation methods for topic models , 2009, ICML '09.

[3]  Dat Quoc Nguyen,et al.  Improving Topic Models with Latent Feature Word Representations , 2015, TACL.

[4]  Edoardo M. Airoldi,et al.  Jordan Boyd-Graber, David Mimno, and David Newman. Care and Feeding of Topic Models: Problems, Diagnostics, and Improvements. Handbook of Mixed Membership Models and Their Applications, 2014. , 2014 .

[5]  Jordan L. Boyd-Graber,et al.  Interactive topic modeling , 2014, ACL.

[6]  Pengtao Xie,et al.  Integrating Document Clustering and Topic Modeling , 2013, UAI.

[7]  Anjusha Pimpalshende,et al.  Test model for stop word removal of devnagari text documents based on finite automata , 2017, 2017 IEEE International Conference on Power, Control, Signals and Instrumentation Engineering (ICPCSI).

[8]  Michael Röder,et al.  Exploring the Space of Topic Coherence Measures , 2015, WSDM.

[9]  Tim Menzies,et al.  What is wrong with topic modeling? And how to fix it using search-based software engineering , 2016, Inf. Softw. Technol..

[10]  Timothy Baldwin,et al.  Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality , 2014, EACL.

[11]  Rajarshi Das,et al.  Gaussian LDA for Topic Models with Word Embeddings , 2015, ACL.

[12]  Maarten Marx,et al.  On Horizontal and Vertical Separation in Hierarchical Text Classification , 2016, ICTIR.

[13]  Liangxiao Jiang,et al.  Two feature weighting approaches for naive Bayes text classifiers , 2016, Knowl. Based Syst..