Unsupervised Concept Hierarchy Learning: A Topic Modeling Guided Approach

Abstract This paper proposes an efficient and scalable method for concept extraction and concept hierarchy learning from large unstructured text corpus which is guided by a topic modeling process. The method leverages “concepts” from statistically discovered “topics” and then learns a hierarchy of those concepts by exploiting a subsumption relation between them. Advantage of the proposed method is that the entire process falls under the unsupervised learning paradigm thus the use of a domain specific training corpus can be eliminated. Given a massive collection of text documents, the method maps topics to concepts by some lightweight statistical and linguistic processes and then probabilistically learns the subsumption hierarchy. Extensive experiments with large text corpora such as BBC News dataset and Reuters News corpus shows that our proposed method outperforms some of the existing methods for concept extraction and efficient concept hierarchy learning is possible if the overall task is guided by a topic modeling process.

[1]  W. Bruce Croft,et al.  Deriving concept hierarchies from text , 1999, SIGIR '99.

[2]  Peter D. Turney Learning Algorithms for Keyphrase Extraction , 2000, Information Retrieval.

[3]  Carl Gutwin,et al.  KEA: practical automatic keyphrase extraction , 1999, DL '99.

[4]  Bruce Krulwich,et al.  Learning user information interests through extraction of semantically significant phrases , 1996 .

[5]  Derek Greene,et al.  Practical solutions to the problem of diagonal dominance in kernel document clustering , 2006, ICML.

[6]  Min Song,et al.  KPSpotter: a flexible information gain-based keyphrase extraction system , 2003, WIDM '03.

[7]  Aditya G. Parameswaran,et al.  Towards the web of concepts , 2010, Proc. VLDB Endow..

[8]  Erik Cambria,et al.  A graph-based approach to commonsense concept extraction and semantic similarity detection , 2013, WWW.

[9]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[10]  William F. Punch,et al.  Automated Concept Extraction From Plain Text , 1998 .

[11]  Hideki Mima,et al.  Automatic recognition of multi-word terms:. the C-value/NC-value method , 2000, International Journal on Digital Libraries.

[12]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[13]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[14]  Chris Mattmann,et al.  ACE: improving search engines via Automatic Concept Extraction , 2004, Proceedings of the 2004 IEEE International Conference on Information Reuse and Integration, 2004. IRI 2004..

[15]  David B. Dunson,et al.  Probabilistic topic models , 2011, KDD '11 Tutorials.

[16]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..