Learning a Concept Hierarchy from Multi-labeled Documents

While topic models can discover patterns of word usage in large corpora, it is difficult to meld this unsupervised structure with noisy, human-provided labels, especially when the label space is large. In this paper, we present a model—Label to Hierarchy (L2H)—that can induce a hierarchy of user-generated labels and the topics associated with those labels from a set of multi-labeled documents. The model is robust enough to account for missing labels from untrained, disparate annotators and provide an interpretable summary of an otherwise unwieldy label set. We show empirically the effectiveness of L2H in predicting held-out words and labels for unseen documents.

[1]  Hector Garcia-Molina,et al.  Collaborative Creation of Communal Hierarchical Taxonomies in Social Tagging Systems , 2006 .

[2]  Susan T. Dumais,et al.  Partially labeled topic models for interpretable text mining , 2011, KDD.

[3]  Chong Wang,et al.  Nested Hierarchical Dirichlet Processes , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  P. Schmitz,et al.  Inducing Ontology from Flickr Tags , 2006 .

[5]  Kristina Lerman,et al.  Constructing folksonomies from user-specified relations on flickr , 2009, WWW '09.

[6]  Sean Gerrish,et al.  Predicting Legislative Roll Calls from Text , 2011, ICML.

[7]  Viet-An Nguyen,et al.  Lexical and Hierarchical Topic Regression , 2013, NIPS.

[8]  Yee Whye Teh,et al.  A Hierarchical Bayesian Language Model Based On Pitman-Yor Processes , 2006, ACL.

[9]  Philip Resnik,et al.  Sometimes Average is Best: The Importance of Averaging for Prediction using MCMC Inference in Topic Modeling , 2014, EMNLP.

[10]  Jordan L. Boyd-Graber,et al.  Collecting Semantic Similarity Ratings to Connect Concepts in Assistive Communication Tools , 2012, Modeling, Learning, and Processing of Text Technological Data Structures.

[11]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[12]  David J. C. MacKay,et al.  A hierarchical Dirichlet language model , 1995, Natural Language Engineering.

[13]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[14]  Dragomir R. Radev,et al.  How to Analyze Political Attention with Minimal Assumptions and Costs , 2010 .

[15]  Sean Gerrish,et al.  How They Vote: Issue-Adjusted Models of Legislative Behavior , 2012, NIPS.

[16]  Justin Grimmer,et al.  A Bayesian Hierarchical Topic Model for Political Texts: Measuring Expressed Agendas in Senate Press Releases , 2010, Political Analysis.

[17]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.

[18]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.

[19]  Min-Ling Zhang,et al.  A Review on Multi-Label Learning Algorithms , 2014, IEEE Transactions on Knowledge and Data Engineering.

[20]  Justin Grimmer Representational Style in Congress: What Legislators Say and Why It Matters , 2013 .

[21]  Thomas L. Griffiths,et al.  Hierarchical Topic Models and the Nested Chinese Restaurant Process , 2003, NIPS.

[22]  Haixun Wang,et al.  Automatic taxonomy construction from keywords , 2012, KDD.

[23]  Hanna Wallach,et al.  Structured Topic Models for Language , 2008 .

[24]  Alexander J. Smola,et al.  The Nested Chinese Restaurant Franchise Process: User Tracking and Document Modeling , 2013 .

[25]  David M. Blei,et al.  Hierarchical relational models for document networks , 2009, 0909.4331.

[26]  Wei Li,et al.  Pachinko allocation: DAG-structured mixture models of topic correlations , 2006, ICML.

[27]  Andrew McCallum,et al.  Topic models for taxonomies , 2012, JCDL '12.

[28]  Susan T. Dumais,et al.  Characterizing Microblogs with Topic Models , 2010, ICWSM.

[29]  Naim Dahnoun,et al.  Studies in Computational Intelligence , 2013 .

[30]  Eric P. Xing,et al.  MedLDA: maximum margin supervised topic models for regression and classification , 2009, ICML '09.

[31]  Frank D. Wood,et al.  Hierarchically Supervised Latent Dirichlet Allocation , 2011, NIPS.

[32]  Mausam,et al.  Crowdsourcing Multi-Label Classification for Taxonomy Creation , 2013, HCOMP.

[33]  Grigorios Tsoumakas,et al.  Mining Multi-label Data , 2010, Data Mining and Knowledge Discovery Handbook.

[34]  Kathleen McKeown,et al.  A Hierarchical Model of Web Summaries , 2011, ACL.

[35]  Timothy N. Rubin,et al.  Statistical topic models for multi-label document classification , 2011, Machine Learning.

[36]  Lior Rokach,et al.  Data Mining And Knowledge Discovery Handbook , 2005 .

[37]  Michael I. Jordan,et al.  DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification , 2008, NIPS.

[38]  Philip J. Cowans Probabilistic Document Modelling , 2006 .

[39]  S. V. N. Vishwanathan,et al.  Efficient max-margin multi-label classification with applications to zero-shot learning , 2012, Machine Learning.

[40]  Philip Resnik,et al.  SITS: A Hierarchical Nonparametric Model using Speaker Identity for Topic Segmentation in Multiparty Conversations , 2012, ACL.

[41]  Lydia B. Chilton,et al.  Cascade: crowdsourcing taxonomy creation , 2013, CHI.

[42]  Michael S. Bernstein,et al.  Scalable multi-label annotation , 2014, CHI.

[43]  Saso Dzeroski,et al.  An extensive experimental comparison of methods for multi-label learning , 2012, Pattern Recognit..

[44]  Wei Li,et al.  Mixtures of hierarchical topics with Pachinko allocation , 2007, ICML '07.

[45]  Chong Wang,et al.  Simultaneous image classification and annotation , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[46]  Andrew McCallum,et al.  Topic Models Conditioned on Arbitrary Features with Dirichlet-multinomial Regression , 2008, UAI.

[47]  Ruslan Salakhutdinov,et al.  Evaluation methods for topic models , 2009, ICML '09.

[48]  Herschel F. Thomas,et al.  The Importance of Attention Diversity and How to Measure It , 2014 .

[49]  Xiaohua Hu,et al.  Tree Labeled LDA: A Hierarchical model for web summaries , 2013, 2013 IEEE International Conference on Big Data.

[50]  Ramesh Nallapati,et al.  Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora , 2009, EMNLP.

[51]  Wei Li,et al.  Nonparametric Bayes Pachinko Allocation , 2007, UAI.

[52]  Tamás Vicsek,et al.  Extracting Tag Hierarchies , 2013, PloS one.

[53]  Michael I. Jordan,et al.  Tree-Structured Stick Breaking for Hierarchical Data , 2010, NIPS.