Graph-Sparse LDA: A Topic Model with Structured Sparsity

Topic modeling is a powerful tool for uncovering latent structure in many domains, including medicine, finance, and vision. The goals for the model vary depending on the application: sometimes the discovered topics are used for prediction or another downstream task. In other cases, the content of the topic may be of intrinsic scientific interest. Unfortunately, even when one uses modern sparse techniques, discovered topics are often difficult to interpret due to the high dimensionality of the underlying space. To improve topic interpretability, we introduce Graph-Sparse LDA, a hierarchical topic model that uses knowledge of relationships between words (e.g., as encoded by an ontology). In our model, topics are summarized by a few latent concept-words from the underlying graph that explain the observed words. Graph-Sparse LDA recovers sparse, interpretable summaries on two real-world biomedical datasets while matching state-of-the-art prediction performance.

[1]  Thomas L. Griffiths,et al.  The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies , 2007, JACM.

[2]  Yang Wang,et al.  Human Action Recognition by Semilatent Topic Models , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[4]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[5]  William R. Hersh,et al.  Reducing workload in systematic review preparation using automated citation classification. , 2006, Journal of the American Medical Informatics Association : JAMIA.

[6]  Frank D. Wood,et al.  Hierarchically Supervised Latent Dirichlet Allocation , 2011, NIPS.

[7]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[8]  Eric P. Xing,et al.  Sparse Additive Generative Models of Text , 2011, ICML.

[9]  Timothy Baldwin,et al.  Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality , 2014, EACL.

[10]  Anima Anandkumar,et al.  Tensor decompositions for learning latent variable models , 2012, J. Mach. Learn. Res..

[11]  David B. Dunson,et al.  Topic Modeling with Nonparametric Markov Tree , 2011, ICML.

[12]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Xiaojin Zhu,et al.  A Topic Model for Word Sense Disambiguation , 2007, EMNLP.

[14]  Marc Light,et al.  Hiding a Semantic Hierarchy in a Markov Model , 1999, ACL 1999.

[15]  C E Lipscomb,et al.  Medical Subject Headings (MeSH). , 2000, Bulletin of the Medical Library Association.

[16]  Xiaohua Hu,et al.  Tree Labeled LDA: A Hierarchical model for web summaries , 2013, 2013 IEEE International Conference on Big Data.

[17]  Wei Li,et al.  Pachinko allocation: DAG-structured mixture models of topic correlations , 2006, ICML.

[18]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[19]  Carla E. Brodley,et al.  Active learning for biomedical citation screening , 2010, KDD.

[20]  Guillaume Bouchard,et al.  Latent IBP Compound Dirichlet Allocation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Andrew McCallum,et al.  Optimizing Semantic Coherence in Topic Models , 2011, EMNLP.

[22]  Jordan L. Boyd-Graber,et al.  Efficient Tree-Based Topic Modeling , 2012, ACL.

[23]  Xiaojin Zhu,et al.  Incorporating domain knowledge into topic modeling via Dirichlet Forest priors , 2009, ICML '09.

[24]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[25]  Chong Wang,et al.  The IBP Compound Dirichlet Process and its Application to Focused Topic Modeling , 2010, ICML.

[26]  Yichuan Zhang,et al.  Advances in Neural Information Processing Systems 25 , 2012 .

[27]  Finale Doshi-Velez,et al.  Comorbidity Clusters in Autism Spectrum Disorders: An Electronic Health Record Time-Series Analysis , 2014, Pediatrics.

[28]  Emily B. Fox,et al.  Concept Modeling with Superwords , 2012, ArXiv.

[29]  M. Ghassemi,et al.  Topic Models for Mortality Modeling in Intensive Care Units , 2012 .

[30]  F B ROGERS,et al.  Medical Subject Headings , 1948, Nature.

[31]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[32]  Timothy Baldwin,et al.  Automatic Evaluation of Topic Coherence , 2010, NAACL.

[33]  Chong Wang,et al.  Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process , 2009, NIPS.

[34]  J M Grimshaw,et al.  Effect of clinical guidelines on medical practice: a systematic review of rigorous evaluations. , 1994, Lancet.

[35]  Thomas L. Griffiths,et al.  The Indian Buffet Process: An Introduction and Review , 2011, J. Mach. Learn. Res..