Tackling topic general words in topic modeling

Topic models are a prevailing tool for exploring latent topics in documents, and for helping to complete many NLP tasks. To obtain good topics for a corpus, a preprocessing step is often needed to remove common stop words and identify topic general words (TGW) from the corpus. Such words can seriously harm the topic formation because they create spurious co-occurrence of unrelated words. Also, they are likely to occupy top positions of multiple topics, lead to many unrelated words being grouped under a topic, and consequently result in inscrutable and similar topics. In an application, one typically manually identifies and removes a list of TGWs in the corpus. This is a time consuming process and very hard to do by a layman user. In this paper, we aim to solve this problem automatically. The proposed approaches can be based on the current corpus alone or multiple corpora. In the latter case, a novel continuous learning method is proposed that learns from past results of multiple domain corpora to help identify TGWs in the current domain. We conduct experiments in two real-world datasets, and the experimental results show that the proposed approaches achieve superior results. HighlightsStudy the problem of topic general words in topic modeling.Propose a metric generality score to measure the generality of a word.Propose a new topic model generality-sensitive LDA to exploit generality scores in modeling.Propose a continuous learning approach that can use multiple domains to find topic general words.

[1]  Arjun Mukherjee,et al.  Aspect Extraction through Semi-Supervised Modeling , 2012, ACL.

[2]  Edoardo M. Airoldi,et al.  Jordan Boyd-Graber, David Mimno, and David Newman. Care and Feeding of Topic Models: Problems, Diagnostics, and Improvements. Handbook of Mixed Membership Models and Their Applications, 2014. , 2014 .

[3]  Hongfei Yan,et al.  Jointly Modeling Aspects and Opinions with a MaxEnt-LDA Hybrid , 2010, EMNLP.

[4]  Xiaohui Yan,et al.  A biterm topic model for short texts , 2013, WWW.

[5]  Chunyan Miao,et al.  Generative Topic Embedding: a Continuous Representation of Documents , 2016, ACL.

[6]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[7]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[8]  Shasha Wang,et al.  Adapting naive Bayes tree for text classification , 2015, Knowledge and Information Systems.

[9]  Peng Zhang,et al.  Toward value difference metric with attribute weighting , 2017, Knowledge and Information Systems.

[10]  Bing Liu,et al.  Topic Modeling using Topics from Many Domains, Lifelong Learning and Big Data , 2014, ICML.

[11]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[12]  Quentin Pleple,et al.  Interactive Topic Modeling , 2013 .

[13]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[14]  Arjun Mukherjee,et al.  Aspect Extraction with Automated Prior Knowledge Learning , 2014, ACL.

[15]  Xiaojin Zhu,et al.  Incorporating domain knowledge into topic modeling via Dirichlet Forest priors , 2009, ICML '09.

[16]  Andrew McCallum,et al.  Optimizing Semantic Coherence in Topic Models , 2011, EMNLP.

[17]  Qiang Yang,et al.  Lifelong Machine Learning Systems: Beyond Learning Algorithms , 2013, AAAI Spring Symposium: Lifelong Machine Learning.

[18]  ChengXiang Zhai,et al.  Structural Topic Model for Latent Topical Structure Analysis , 2011, ACL.

[19]  D. Mimno,et al.  Care and Feeding of Topic Models: Problems, Diagnostics, and Improvements , 2014 .

[20]  Yang Zhang,et al.  Modeling user posting behavior on social media , 2012, SIGIR '12.

[21]  Shasha Wang,et al.  Structure extended multinomial naive Bayes , 2016, Inf. Sci..

[22]  Andrew McCallum,et al.  Rethinking LDA: Why Priors Matter , 2009, NIPS.

[23]  Gregor Heinrich Parameter estimation for text analysis , 2009 .

[24]  Shuang-Hong Yang,et al.  Bridging the Language Gap: Topic Adaptation for Documents with Different Technicality , 2011, AISTATS.

[25]  Hakan Altinçay,et al.  A novel framework for termset selection and weighting in binary text classification , 2014, Eng. Appl. Artif. Intell..

[26]  Nian-Shing Chen,et al.  A novel contextual topic model for multi-document summarization , 2015, Expert Syst. Appl..

[27]  Haiyi Zhang,et al.  Naïve Bayes Text Classifier , 2007 .

[28]  Jun Ma,et al.  Transfer Topic Modeling with Ease and Scalability , 2012, SDM.

[29]  Shasha Wang,et al.  Deep feature weighting for naive Bayes and its application to text classification , 2016, Eng. Appl. Artif. Intell..

[30]  Sanjeev Arora,et al.  A Practical Algorithm for Topic Modeling with Provable Guarantees , 2012, ICML.

[31]  Liangxiao Jiang,et al.  Naive Bayes text classifiers: a locally weighted learning approach , 2013, J. Exp. Theor. Artif. Intell..

[32]  Thomas L. Griffiths,et al.  Hierarchical Topic Models and the Nested Chinese Restaurant Process , 2003, NIPS.

[33]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[34]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.

[35]  Liangxiao Jiang,et al.  Randomly selected decision tree for test-cost sensitive learning , 2017, Appl. Soft Comput..

[36]  Liangxiao Jiang,et al.  Two feature weighting approaches for naive Bayes text classifiers , 2016, Knowl. Based Syst..

[37]  Nathan Schneider,et al.  Association for Computational Linguistics: Human Language Technologies , 2011 .

[38]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[39]  Swapnil Mishra,et al.  Experiments with non-parametric topic models , 2014, KDD.

[40]  Dongwoo Kim,et al.  Modeling topic hierarchies with the recursive chinese restaurant process , 2012, CIKM.

[41]  Qiang Yang,et al.  Topic-bridged PLSA for cross-domain text classification , 2008, SIGIR '08.

[42]  Daniel Jurafsky,et al.  Predicting the Rise and Fall of Scientific Topics from Trends in their Rhetorical Framing , 2016, ACL.

[43]  Xu Ling,et al.  Topic sentiment mixture: modeling facets and opinions in weblogs , 2007, WWW '07.

[44]  Hae-Chang Rim,et al.  A new method of parameter estimation for multinomial naive bayes text classifiers , 2002, SIGIR '02.

[45]  Padhraic Smyth,et al.  Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model , 2006, NIPS.

[46]  Alexander J. Smola,et al.  Word Features for Latent Dirichlet Allocation , 2010, NIPS.

[47]  Alexander J. Smola,et al.  Jointly modeling aspects, ratings and sentiments for movie recommendation (JMARS) , 2014, KDD.

[48]  Hal Daumé,et al.  Incorporating Lexical Priors into Topic Models , 2012, EACL.

[49]  Victor S. Sheng,et al.  Noise filtering to improve data and model quality for crowdsourcing , 2016, Knowl. Based Syst..