An Efficient Method for High Quality and Cohesive Topical Phrase Mining

A phrase is a natural, meaningful, and essential semantic unit. In topic modeling, visualizing phrases for individual topics is an effective way to explore and understand unstructured text corpora. However, from phrase quality and topical cohesion perspectives, the outcomes of existing approaches remain to be improved. Usually, the process of topical phrase mining is twofold: phrase mining and topic modeling. For phrase mining, existing approaches often suffer from order sensitive and inappropriate segmentation problems, which make them often extract inferior quality phrases. For topic modeling, traditional topic models do not fully consider the constraints induced by phrases, which may weaken the cohesion. Moreover, existing approaches often suffer from losing domain terminologies since they neglect the impact of domain-level topical distribution. In this paper, we propose an efficient method for high quality and cohesive topical phrase mining. A high quality phrase should satisfy frequency, phraseness, completeness, and appropriateness criteria. In our framework, we integrate quality guaranteed phrase mining method, a novel topic model incorporating the constraint of phrases, and a novel document clustering method into an iterative framework to improve both phrase quality and topical cohesion. We also describe efficient algorithmic designs to execute these methods efficiently. The empirical verification demonstrates that our method outperforms the state-of-the-art methods from the aspects of both interpretability and efficiency.

[1]  Yinan Zhang,et al.  A phrase mining framework for recursive construction of a topical hierarchy , 2013, KDD.

[2]  Carmen R. Wilson VanVoorhis,et al.  Understanding Power and Rules of Thumb for Determining Sample Sizes , 2007 .

[3]  Jiawei Han,et al.  Automatic Construction and Ranking of Topical Keyphrases on Collections of Short Documents , 2014, SDM.

[4]  C. Felser,et al.  Grammatical processing in language learners , 2006, Applied Psycholinguistics.

[5]  N. L. Johnson,et al.  Multivariate Analysis , 1958, Nature.

[6]  Aditya G. Parameswaran,et al.  Towards the web of concepts , 2010, Proc. VLDB Endow..

[7]  Eric Brill,et al.  A Simple Rule-Based Part of Speech Tagger , 1992, HLT.

[8]  Fazli Can,et al.  Concepts and effectiveness of the cover-coefficient-based clustering methodology for text databases , 1990, TODS.

[9]  Donald Geman,et al.  Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images , 1984 .

[10]  Oren Etzioni,et al.  Fast and Intuitive Clustering of Web Documents , 1997, KDD.

[11]  Koby Crammer,et al.  Online Large-Margin Training of Dependency Parsers , 2005, ACL.

[12]  Fazli Can,et al.  Incremental clustering for dynamic information processing , 1993, TOIS.

[13]  Kenneth Ward Church A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text , 1988, ANLP.

[14]  A. McCallum,et al.  Topical N-Grams: Phrase and Topic Discovery, with an Application to Information Retrieval , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[15]  Martin Porter,et al.  Snowball: A language for stemming algorithms , 2001 .

[16]  Kenneth Ward Church,et al.  Using Web-scale N-grams to Improve Base NP Parsing Performance , 2010, COLING.

[17]  Clare R. Voss,et al.  Scalable Topical Phrase Mining from Text Corpora , 2014, Proc. VLDB Endow..

[18]  Jiawei Han,et al.  EKNOT: Event Knowledge from News and Opinions in Twitter , 2016, AAAI.

[19]  Bin Wang,et al.  CITPM: A Cluster-Based Iterative Topical Phrase Mining Framework , 2016, DASFAA.

[20]  Berthold Reinwald,et al.  Multidimensional content eXploration , 2008, Proc. VLDB Endow..

[21]  Hideki Mima,et al.  Automatic recognition of multi-word terms:. the C-value/NC-value method , 2000, International Journal on Digital Libraries.

[22]  Jure Leskovec,et al.  Meme-tracking and the dynamics of the news cycle , 2009, KDD.

[23]  Sean Hughes,et al.  Clustering by Fast Search and Find of Density Peaks , 2016 .

[24]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[25]  Steven Abney,et al.  Parsing By Chunks , 1991 .

[26]  Hanna M. Wallach,et al.  Topic modeling: beyond bag-of-words , 2006, ICML.

[27]  Chun Chen,et al.  Document Summarization Based on Data Reconstruction , 2012, AAAI.

[28]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[29]  Robert V. Lindsey,et al.  A Phrase-Discovering Topic Model Using Hierarchical Pitman-Yor Processes , 2012, EMNLP.

[30]  Carl Gutwin,et al.  KEA: practical automatic keyphrase extraction , 1999, DL '99.

[31]  Mark W. Schmidt,et al.  Accelerated training of conditional random fields with stochastic gradient methods , 2006, ICML.

[32]  Jiawei Han,et al.  Mining Quality Phrases from Massive Text Corpora , 2015, SIGMOD Conference.

[33]  Hong Shen,et al.  Voting Between Multiple Data Representations for Text Chunking , 2005, Canadian AI.

[34]  Paul Deane,et al.  A Nonparametric Method for Extraction of Candidate Phrasal Terms , 2005, ACL.

[35]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[36]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[37]  John D. Lafferty,et al.  Visualizing Topics with Multi-Word Expressions , 2009, 0907.1013.

[38]  Rory A. Fisher,et al.  The Arrangement of Field Experiments , 1992 .

[39]  Fernando Pereira,et al.  Shallow Parsing with Conditional Random Fields , 2003, NAACL.