Scalable Topical Phrase Mining from Text Corpora

While most topic modeling algorithms model text corpora with unigrams, human interpretation often relies on inherent grouping of terms into phrases. As such, we consider the problem of discovering topical phrases of mixed lengths. Existing work either performs post processing to the results of unigram-based topic models, or utilizes complex n-gram-discovery topic models. These methods generally produce low-quality topical phrases or suffer from poor scalability on even moderately-sized datasets. We propose a different approach that is both computationally efficient and effective. Our solution combines a novel phrase mining framework to segment a document into single and multi-word phrases, and a new topic model that operates on the induced document partition. Our approach discovers high quality topical phrases with negligible extra cost to the bag-of-words topic model in a variety of datasets including research publication titles, abstracts, reviews, and news articles.

[1]  Vincent Ng,et al.  Conundrums in Unsupervised Keyphrase Extraction: Making Sense of the State-of-the-Art , 2010, COLING.

[2]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[3]  John D. Lafferty,et al.  Visualizing Topics with Multi-Word Expressions , 2009, 0907.1013.

[4]  Michal Rosen-Zvi,et al.  Hidden Topic Markov Models , 2007, AISTATS.

[5]  S. Cooper Chapter 6. , 1887, Interviews with Rudolph A Marcus on Electron Transfer Reactions.

[6]  Yang Song,et al.  Topical Keyphrase Extraction from Twitter , 2011, ACL.

[7]  A. McCallum,et al.  Topical N-Grams: Phrase and Topic Discovery, with an Application to Information Retrieval , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[8]  Jiawei Han,et al.  Automatic Construction and Ranking of Topical Keyphrases on Collections of Short Documents , 2014, SDM.

[9]  T. Minka Estimating a Dirichlet distribution , 2012 .

[10]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[11]  Daniel Jurafsky,et al.  Is Knowledge-Free Induction of Multiword Unit Dictionary Headwords a Solved Problem? , 2001, EMNLP.

[12]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[13]  Alice H. Oh,et al.  Aspect and sentiment unification model for online review analysis , 2011, WSDM '11.

[14]  ChengXiang Zhai,et al.  Automatic labeling of multinomial topic models , 2007, KDD '07.

[15]  Ted Pedersen,et al.  Fishing for Exactness , 1996, ArXiv.

[16]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[17]  Zhi Geng,et al.  Structural Learning of Chain Graphs via Decomposition. , 2008, Journal of machine learning research : JMLR.

[18]  Hanna M. Wallach,et al.  Topic modeling: beyond bag-of-words , 2006, ICML.

[19]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[20]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[21]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[22]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[23]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[24]  Yue Lu,et al.  Enriching text representation with frequent pattern mining for probabilistic topic modeling , 2012, ASIST.

[25]  Timothy Baldwin,et al.  Automatic Labelling of Topic Models , 2011, ACL.

[26]  Zhiyuan Liu,et al.  Clustering to Find Exemplar Terms for Keyphrase Extraction , 2009, EMNLP.

[27]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[28]  Robert V. Lindsey,et al.  A Phrase-Discovering Topic Model Using Hierarchical Pitman-Yor Processes , 2012, EMNLP.

[29]  Kenneth Ward Church,et al.  Using Statistics in Lexical Analysis , 2003, Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon.