A phrase mining framework for recursive construction of a topical hierarchy

A high quality hierarchical organization of the concepts in a dataset at different levels of granularity has many valuable applications such as search, summarization, and content browsing. In this paper we propose an algorithm for recursively constructing a hierarchy of topics from a collection of content-representative documents. We characterize each topic in the hierarchy by an integrated ranked list of mixed-length phrases. Our mining framework is based on a phrase-centric view for clustering, extracting, and ranking topical phrases. Experiments with datasets from three different domains illustrate our ability to generate hierarchies of high quality topics represented by meaningful phrases.

[1]  Matthew Hurst,et al.  A Language Model Approach to Keyphrase Extraction , 2003, ACL 2003.

[2]  Jiawei Han,et al.  KERT: Automatic Extraction and Ranking of Topical Keyphrases from Content-Representative Document Titles , 2013, ArXiv.

[3]  Jian Pei,et al.  Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[4]  Hanna M. Wallach,et al.  Topic modeling: beyond bag-of-words , 2006, ICML.

[5]  A. McCallum,et al.  Topical N-Grams: Phrase and Topic Discovery, with an Application to Information Retrieval , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[6]  Robert V. Lindsey,et al.  A Phrase-Discovering Topic Model Using Hierarchical Pitman-Yor Processes , 2012, EMNLP.

[7]  Luigi Di Caro,et al.  Using tagflake for condensing navigable tag hierarchies from tag clouds , 2008, KDD.

[8]  George A. Vouros,et al.  Discovering Subsumption Hierarchies of Ontology Concepts from Text Corpora , 2007, IEEE/WIC/ACM International Conference on Web Intelligence (WI'07).

[9]  Hongfei Yan,et al.  Automatic labeling hierarchical topics , 2012, CIKM '12.

[10]  ChengXiang Zhai,et al.  Automatic labeling of multinomial topic models , 2007, KDD '07.

[11]  Dragomir R. Radev,et al.  Book Review: Graph-Based Natural Language Processing and Information Retrieval by Rada Mihalcea and Dragomir Radev , 2011, CL.

[12]  Daniel Jurafsky,et al.  Learning Syntactic Patterns for Automatic Hypernym Discovery , 2004, NIPS.

[13]  Wei Li,et al.  Pachinko allocation: DAG-structured mixture models of topic correlations , 2006, ICML.

[14]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[15]  Wei Li,et al.  Mixtures of hierarchical topics with Pachinko allocation , 2007, ICML '07.

[16]  Ken Barker,et al.  Using Noun Phrase Heads to Extract Document Keyphrases , 2000, Canadian Conference on AI.

[17]  Eric P. Xing,et al.  Sparse Additive Generative Models of Text , 2011, ICML.

[18]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[19]  Alexander Pretschner,et al.  Ontology-based personalized search and browsing , 2003, Web Intell. Agent Syst..

[20]  Maria P. Grineva,et al.  Extracting key terms from noisy and multitheme documents , 2009, WWW '09.

[21]  Thomas L. Griffiths,et al.  Hierarchical Topic Models and the Nested Chinese Restaurant Process , 2003, NIPS.

[22]  Haixun Wang,et al.  Automatic taxonomy construction from keywords , 2012, KDD.

[23]  John D. Lafferty,et al.  Visualizing Topics with Multi-Word Expressions , 2009, 0907.1013.

[24]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[25]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[26]  Min-Yen Kan,et al.  Re-examining Automatic Keyphrase Extraction Approaches in Scientific Articles , 2009, MWE@IJCNLP.

[27]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[28]  Shui-Lung Chuang,et al.  A practical web-based approach to generating topic hierarchy for text segments , 2004, CIKM '04.

[29]  Mark E. J. Newman,et al.  An efficient and principled method for detecting communities in networks , 2011, Physical review. E, Statistical, nonlinear, and soft matter physics.

[30]  Mohammed Bennamoun,et al.  Ontology learning from text: A look back and into the future , 2012, CSUR.

[31]  Yang Song,et al.  Topical Keyphrase Extraction from Twitter , 2011, ACL.

[32]  Stefano Faralli,et al.  A Graph-Based Algorithm for Inducing Lexical Taxonomies from Scratch , 2011, IJCAI.

[33]  Zhiyuan Liu,et al.  Automatic Keyphrase Extraction via Topic Decomposition , 2010, EMNLP.

[34]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[35]  Yizhou Sun,et al.  ETM: Entity Topic Models for Mining Documents Associated with Entities , 2012, 2012 IEEE 12th International Conference on Data Mining.

[36]  W. Bruce Croft,et al.  Discovering and Comparing Topic Hierarchies , 2000, RIAO.