Handling Collocations in Hierarchical Latent Tree Analysis for Topic Modeling

Topic modeling has been one of the most active research areas in machine learning in recent years. Hierarchical latent tree analysis (HLTA) has been recently proposed for hierarchical topic modeling and has shown superior performance over state-of-the-art methods. However, the models used in HLTA have a tree structure and cannot represent the different meanings of multiword expressions sharing the same word appropriately. Therefore, we propose a method for extracting and selecting collocations as a preprocessing step for HLTA. The selected collocations are replaced with single tokens in the bag-of-words model before running HLTA. Our empirical evaluation shows that the proposed method led to better performance of HLTA on three of the four data sets tested.

[1]  Chong Wang,et al.  Nested Hierarchical Dirichlet Processes , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Timothy Baldwin,et al.  Multiword Expressions: A Pain in the Neck for NLP , 2002, CICLing.

[3]  Dragomir R. Radev,et al.  The ACL anthology network corpus , 2009, Language Resources and Evaluation.

[4]  Mark Steyvers,et al.  Topics in semantic representation. , 2007, Psychological review.

[5]  Tao Chen,et al.  Model-based multidimensional clustering of categorical data , 2012, Artif. Intell..

[6]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[7]  Farhan Khawar,et al.  Latent tree models for hierarchical topic detection , 2016, Artif. Intell..

[8]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[9]  Thomas L. Griffiths,et al.  The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies , 2007, JACM.

[10]  Nevin Lianwen Zhang,et al.  Hierarchical latent class models for cluster analysis , 2002, J. Mach. Learn. Res..

[11]  Hanna M. Wallach,et al.  Topic modeling: beyond bag-of-words , 2006, ICML.

[12]  Tomaz Erjavec,et al.  The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages , 2006, LREC.

[13]  Lidong Bing,et al.  Nonparametric Topic Modeling Using Chinese Restaurant Franchise with Buddy Customers , 2015, ECIR.

[14]  A. McCallum,et al.  Topical N-Grams: Phrase and Topic Discovery, with an Application to Information Retrieval , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[15]  Michael Nokel,et al.  Accounting ngrams and multi-word terms can improve topic models , 2016, MWE@ACL.

[16]  Tengfei Liu,et al.  Hierarchical Latent Tree Analysis for Topic Detection , 2014, ECML/PKDD.

[17]  Chun Fai Leung,et al.  Topic Browsing System for Research Papers Based on Hierarchical Latent Tree Analysis , 2017, APWeb/WAIM.

[18]  Leonard K. M. Poon,et al.  Progressive EM for Latent Tree Models and Hierarchical Topic Detection , 2015, AAAI.

[19]  Timothy Baldwin,et al.  On collocations and topic models , 2013, TSLP.

[20]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[21]  Chun Fai Leung,et al.  Mining Textual Reviews with Hierarchical Latent Tree Analysis , 2017, DMBD.

[22]  Andrew McCallum,et al.  Optimizing Semantic Coherence in Topic Models , 2011, EMNLP.