TaxoGen: Unsupervised Topic Taxonomy Construction by Adaptive Term Embedding and Clustering

Taxonomy construction is not only a fundamental task for semantic analysis of text corpora, but also an important step for applications such as information filtering, recommendation, and Web search. Existing pattern-based methods extract hypernym-hyponym term pairs and then organize these pairs into a taxonomy. However, by considering each term as an independent concept node, they overlook the topical proximity and the semantic correlations among terms. In this paper, we propose a method for constructing topic taxonomies, wherein every node represents a conceptual topic and is defined as a cluster of semantically coherent concept terms. Our method, TaxoGen, uses term embeddings and hierarchical clustering to construct a topic taxonomy in a recursive fashion. To ensure the quality of the recursive process, it consists of: (1) an adaptive spherical clustering module for allocating terms to proper levels when splitting a coarse topic into fine-grained ones; (2) a local embedding module for learning term embeddings that maintain strong discriminative power at different levels of the taxonomy. Our experiments on two real datasets demonstrate the effectiveness of TaxoGen compared with baseline methods.

[1]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[2]  Jiawei Han,et al.  MetaPAD: Meta Pattern Discovery from Massive Text Corpora , 2017, KDD.

[3]  Luis Gravano,et al.  Snowball: extracting relations from large plain-text collections , 2000, DL '00.

[4]  Stefano Faralli,et al.  A Large DataBase of Hypernymy Relations Extracted from the Web , 2016, LREC.

[5]  Wanxiang Che,et al.  Learning Semantic Hierarchies via Word Embeddings , 2014, ACL.

[6]  Siu Cheung Hui,et al.  Learning Term Embeddings for Taxonomic Relation Identification Using Dynamic Weighting Neural Network , 2016, EMNLP.

[7]  See-Kiong Ng,et al.  Taxonomy Construction Using Syntactic Contextual Evidence , 2014, EMNLP.

[8]  Yi Yang,et al.  Efficient Methods for Inferring Large Sparse Topic Hierarchies , 2015, ACL.

[9]  Estevam R. Hruschka,et al.  Toward an Architecture for Never-Ending Language Learning , 2010, AAAI.

[10]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[11]  Thomas L. Griffiths,et al.  Hierarchical Topic Models and the Nested Chinese Restaurant Process , 2003, NIPS.

[12]  Haixun Wang,et al.  Automatic taxonomy construction from keywords , 2012, KDD.

[13]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Haixun Wang,et al.  Learning Term Embeddings for Hypernymy Identification , 2015, IJCAI.

[15]  Dan Klein,et al.  Structured Learning for Taxonomy Induction with Belief Propagation , 2014, ACL.

[16]  David J. Weir,et al.  Learning to Distinguish Hypernyms and Co-Hyponyms , 2014, COLING.

[17]  Simone Paolo Ponzetto,et al.  Deriving a Large-Scale Taxonomy from Wikipedia , 2007, AAAI.

[18]  Boris Motik,et al.  Exploiting Partial Information in Taxonomy Construction , 2009, Description Logics.

[19]  Zornitsa Kozareva,et al.  A Semi-Supervised Method to Learn and Construct Taxonomies Using the Web , 2010, EMNLP.

[20]  Bo Zhang,et al.  StatSnowball: a statistical approach to extracting entity relationships , 2009, WWW '09.

[21]  Wei Li,et al.  Mixtures of hierarchical topics with Pachinko allocation , 2007, ICML '07.

[22]  Steffen Staab,et al.  Comparing Conceptual, Divise and Agglomerative Clustering for Learning Taxonomies from Text , 2004, ECAI.

[23]  Liyuan Liu,et al.  Expert Finding in Heterogeneous Bibliographic Networks with Locally-trained Embeddings , 2018, ArXiv.

[24]  Yinan Zhang,et al.  A phrase mining framework for recursive construction of a topical hierarchy , 2013, KDD.

[25]  Ravi Kumar,et al.  On Semi-Automated Web Taxonomy Construction , 2001, WebDB.

[26]  Gregory Grefenstette,et al.  INRIASAC: Simple Hypernym Extraction Methods , 2015, *SEMEVAL.

[27]  Junjie Yao,et al.  Evolutionary taxonomy construction from dynamic tag space , 2010, World Wide Web.

[28]  Horacio Saggion,et al.  Supervised Distributional Hypernym Discovery via Domain Adaptation , 2016, EMNLP.

[29]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[30]  Gerhard Weikum,et al.  PATTY: A Taxonomy of Relational Patterns with Semantic Types , 2012, EMNLP.

[31]  Alexander J. Smola,et al.  Taxonomy discovery for personalized recommendation , 2014, WSDM.

[32]  Grace Hui Yang,et al.  A Metric-based Framework for Automatic Taxonomy Induction , 2009, ACL.

[33]  Haixun Wang,et al.  Probase: a probabilistic taxonomy for text understanding , 2012, SIGMOD Conference.

[34]  Stefano Faralli,et al.  TAXI at SemEval-2016 Task 13: a Taxonomy Induction Method based on Lexico-Syntactic Patterns, Substrings and Focused Crawling , 2016, *SEMEVAL.