Topic Modeling with Structured Priors for Text-Driven Science

Many scientific disciplines are being revolutionized by the explosion of public data on the web and social media, particularly in health and social sciences. For instance, by analyzing social media messages, we can instantly measure public opinion, understand population behaviors, and monitor events such as disease outbreaks and natural disasters. Taking advantage of these data sources requires tools that can make sense of massive amounts of unstructured and unlabeled text. Topic models, statistical models that posit low-dimensional representations of data, can uncover interesting latent structure in large text datasets and are popular tools for automatically identifying prominent themes in text. For example, prominent themes of discussion in social media might include politics and health. To be useful in scientific analyses, topic models must learn interpretable patterns that accurately correspond to real-world concepts of interest. This thesis will introduce topic models that can encode additional structures such as factorizations, hierarchies, and correlations of topics, and can incorporate supervision and domain knowledge. For example, topics about elections and Congressional legislation are related to each other (as part of a

[1]  James Allan,et al.  On-Line New Event Detection and Tracking , 1998, SIGIR Forum.

[2]  Philip Resnik,et al.  Tea Party in the House: A Hierarchical Ideal Point Topic Model and Its Application to Republican Legislators in the 112th Congress , 2015, ACL.

[3]  Leonardo Max Batista Claudino,et al.  Beyond LDA: Exploring Supervised Topic Modeling for Depression-Related Language in Twitter , 2015, CLPsych@HLT-NAACL.

[4]  Ryan P. Adams,et al.  Gradient-based Hyperparameter Optimization through Reversible Learning , 2015, ICML.

[5]  Michael Röder,et al.  Exploring the Space of Topic Coherence Measures , 2015, WSDM.

[6]  Mark Dredze,et al.  Sprite: Generalizing Topic Models with Structured Priors , 2015, TACL.

[7]  Jeffrey Heer,et al.  TopicCheck: Interactive Alignment for Assessing Topic Model Stability , 2015, NAACL.

[8]  Mark Dredze,et al.  A large-scale quantitative analysis of latent factors and sentiment in online doctor reviews , 2014, J. Am. Medical Informatics Assoc..

[9]  Yupeng Gu,et al.  Topic-factorized ideal point estimation model for legislative voting network , 2014, KDD.

[10]  Alexander J. Smola,et al.  Reducing the sampling complexity of topic models , 2014, KDD.

[11]  Michael J. Paul,et al.  Discovering Health Topics in Social Media Using Topic Models , 2014, PloS one.

[12]  David M. Blei,et al.  The Inverse Regression Topic Model , 2014, ICML.

[13]  Chandler May,et al.  Particle Filter Rejuvenation and Latent Dirichlet Allocation , 2014, ACL.

[14]  Timothy Baldwin,et al.  Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality , 2014, EACL.

[15]  David G. Rand,et al.  Structural Topic Models for Open‐Ended Survey Responses , 2014, American Journal of Political Science.

[16]  Michael Elhadad,et al.  Redundancy-Aware Topic Modeling for Patient Record Notes , 2014, PloS one.

[17]  Viet-An Nguyen,et al.  Lexical and Hierarchical Topic Regression , 2013, NIPS.

[18]  Matthew L. Jockers,et al.  Significant themes in 19th-century literature , 2013 .

[19]  W. Chapman,et al.  Using Twitter to Examine Smoking Behavior and Perceptions of Emerging Tobacco Products , 2013, Journal of medical Internet research.

[20]  Mark Dredze,et al.  What Affects Patient (Dis)satisfaction? Analyzing Online Doctor Ratings with a Joint Topic-Sentiment Model , 2013, AAAI 2013.

[21]  Jordan Boyd-Graber,et al.  Online Latent Dirichlet Allocation with Infinite Vocabulary , 2013, ICML.

[22]  Ahmer Farooq,et al.  Online reviews of 500 urologists. , 2013, The Journal of urology.

[23]  Mark Dredze,et al.  Topic Models and Metadata for Visualizing Text Corpora , 2013, NAACL.

[24]  Mark Dredze,et al.  Drug Extraction from the Web: Summarizing Drug Experiences with Multi-Dimensional Topic Models , 2013, NAACL.

[25]  Uwe Sander,et al.  Eight Questions About Physician-Rating Websites: A Systematic Review , 2013, Journal of medical Internet research.

[26]  A. Darzi,et al.  Harnessing the cloud of patient experience: using social media to detect poor quality healthcare , 2013, BMJ quality & safety.

[27]  Margaret E. Roberts,et al.  The structural topic model and applied social science , 2013, ICONIP 2013.

[28]  Michael J. Paul,et al.  Carmen: A Twitter Geolocation System with Applications to Public Health , 2013 .

[29]  William W. Cohen,et al.  Regularization of Latent Variable Models to Obtain Sparsity , 2013, SDM.

[30]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[31]  Sean Gerrish,et al.  How They Vote: Issue-Adjusted Models of Legislative Behavior , 2012, NIPS.

[32]  Mark Dredze,et al.  Factorial LDA: Sparse Multi-Dimensional Text Models , 2012, NIPS.

[33]  M. Miraldo,et al.  Who is more likely to use doctor-rating websites, and why? A cross-sectional study in London , 2012, BMJ Open.

[34]  Mark Dredze,et al.  Experimenting with Drugs (and Topic Models): Multi-Dimensional Exploration of Recreational Drug Discussions , 2012, AAAI Fall Symposium: Information Retrieval and Knowledge Discovery in Biomedical Text.

[35]  Joshua B. Tenenbaum,et al.  Exploiting compositionality to explore a large space of model structures , 2012, UAI.

[36]  Michael J. Paul Mixed Membership Markov Models for Unsupervised Conversation Modeling , 2012, EMNLP.

[37]  David Buttler,et al.  Exploring Topic Coherence over Many Models and Many Topics , 2012, EMNLP.

[38]  William Yang Wang,et al.  Historical Analysis of Legal Opinions with a Sparse Mixed-Effects Latent Variable Model , 2012, ACL.

[39]  J. Bruneau,et al.  The rising prevalence of prescription opioid injection and its association with hepatitis C incidence among street-drug users. , 2012, Addiction.

[40]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[41]  Mark Dredze,et al.  Shared Components Topic Models , 2012, HLT-NAACL.

[42]  David M. Blei,et al.  Visualizing Topic Models , 2012, ICWSM.

[43]  Tim Bingham,et al.  "A costly turn on": patterns of use and perceived consequences of mephedrone based head shop products amongst Irish injectors. , 2012, The International journal on drug policy.

[44]  Hal Daumé,et al.  Incorporating Lexical Priors into Topic Models , 2012, EACL.

[45]  Michael J. Paul,et al.  Unsupervised Part-of-Speech Tagging in Noisy and Esoteric Domains With a Syntactic-Semantic Bayesian HMM , 2012 .

[46]  Michael J Sacopulos,et al.  Online Doctor Reviews: Do They Track Surgeon Volume, a Proxy for Quality of Care? , 2012, Journal of medical Internet research.

[47]  David M. Mimno,et al.  Computational historiography: Data mining in a century of classics journals , 2012, JOCCH.

[48]  Anthony F Jerant,et al.  The cost of satisfaction: a national study of patient satisfaction, health care utilization, expenditures, and mortality. , 2012, Archives of internal medicine.

[49]  Suzanne Fergus,et al.  5,6‐Methylenedioxy‐2‐aminoindane: from laboratory curiosity to ‘legal high’ , 2012, Human psychopharmacology.

[50]  Suzanne Fergus,et al.  Phenomenon of new drugs on the Internet: the case of ketamine derivative methoxetamine , 2012, Human psychopharmacology.

[51]  Matt Taddy,et al.  On Estimation and Selection for Topic Models , 2011, AISTATS.

[52]  Choochart Haruechaiyasak,et al.  Discovering Consumer Insight from Twitter via Sentiment Analysis , 2012, J. Univers. Comput. Sci..

[53]  T. Minka Estimating a Dirichlet distribution , 2012 .

[54]  Jordan L. Boyd-Graber,et al.  Interactive topic modeling , 2014, ACL.

[55]  H. Colón,et al.  The Emerging of Xylazine as a New Drug of Abuse and its Health Consequences among Drug Users in Puerto Rico , 2012, Journal of Urban Health.

[56]  U. Sarkar,et al.  What Patients Say About Their Doctors Online: A Qualitative Content Analysis , 2012, Journal of General Internal Medicine.

[57]  Aniket Kittur,et al.  TopicScape: Semantic Navigation of Document Collections , 2011, ArXiv.

[58]  Aniket Kittur,et al.  TopicViz: Semantic Navigation of Document Collections , 2011, 1110.6200.

[59]  Simon L Hill,et al.  Clinical toxicology of newer recreational drugs , 2011, Clinical toxicology.

[60]  Jacob Ratkiewicz,et al.  Predicting the Political Alignment of Twitter Users , 2011, 2011 IEEE Third Int'l Conference on Privacy, Security, Risk and Trust and 2011 IEEE Third Int'l Conference on Social Computing.

[61]  Matthew Dunn,et al.  Effectiveness of and challenges faced by surveillance systems. , 2011, Drug testing and analysis.

[62]  Noah A. Smith,et al.  Structured Sparsity in Structured Prediction , 2011, EMNLP.

[63]  Andrew McCallum,et al.  Optimizing Semantic Coherence in Topic Models , 2011, EMNLP.

[64]  David Yarowsky,et al.  Hierarchical Bayesian Models for Latent Attribute Detection in Social Media , 2011, ICWSM.

[65]  Mark Dredze,et al.  You Are What You Tweet: Analyzing Twitter for Public Health , 2011, ICWSM.

[66]  Eric P. Xing,et al.  Sparse Additive Generative Models of Text , 2011, ICML.

[67]  Marcel Salathé,et al.  Assessing Vaccination Sentiments with Online Social Media: Implications for Infectious Disease Dynamics and Control , 2011, PLoS Comput. Biol..

[68]  Fabrizio Schifano,et al.  Designer drugs on the internet: a phenomenon out-of-control? the emergence of hallucinogenic drug Bromo-Dragonfly. , 2011, Current clinical pharmacology.

[69]  Christophe G. Giraud-Carrier,et al.  Identifying Health-Related Topics on Twitter - An Exploration of Tobacco-Related Tweets as a Test Topic , 2011, SBP.

[70]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[71]  Johan Bollen,et al.  Twitter mood predicts the stock market , 2010, J. Comput. Sci..

[72]  Andrew McCallum,et al.  Topic regression , 2011 .

[73]  Francis R. Bach,et al.  Online Learning for Latent Dirichlet Allocation , 2010, NIPS.

[74]  Noémie Elhadad,et al.  Detecting salient aspects in online reviews of health providers. , 2010, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[75]  Chareen Snelson,et al.  Image and video disclosure of substance use on social media websites , 2010, Comput. Hum. Behav..

[76]  Brendan T. O'Connor,et al.  A Latent Variable Model for Geographic Lexical Variation , 2010, EMNLP.

[77]  Eric P. Xing,et al.  Staying Informed: Supervised and Semi-Supervised Multi-View Topical Analysis of Ideological Perspective , 2010, EMNLP.

[78]  Michael J. Paul,et al.  Summarizing Contrastive Viewpoints in Opinionated Text , 2010, EMNLP.

[79]  Michael J. Paul,et al.  A Two-Dimensional Topic-Aspect Model for Discovering Multi-Faceted Topics , 2010, AAAI.

[80]  Vincent Ng,et al.  Mining Clustering Dimensions , 2010, ICML.

[81]  Chong Wang,et al.  The IBP Compound Dirichlet Process and its Application to Focused Topic Modeling , 2010, ICML.

[82]  Timothy Baldwin,et al.  Automatic Evaluation of Topic Coherence , 2010, NAACL.

[83]  Isabell M. Welpe,et al.  Predicting Elections with Twitter: What 140 Characters Reveal about Political Sentiment , 2010, ICWSM.

[84]  Brendan T. O'Connor,et al.  From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series , 2010, ICWSM.

[85]  Yutaka Matsuo,et al.  Earthquake shakes Twitter users: real-time event detection by social sensors , 2010, WWW '10.

[86]  Ben Taskar,et al.  Posterior Regularization for Structured Latent Variable Models , 2010, J. Mach. Learn. Res..

[87]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[88]  Chong Wang,et al.  Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process , 2009, NIPS.

[89]  Andrew McCallum,et al.  Rethinking LDA: Why Priors Matter , 2009, NIPS.

[90]  Jiawei Han,et al.  Topic modeling for OLAP on multidimensional text databases: topic cube and its applications , 2009, Stat. Anal. Data Min..

[91]  Thomas L. Griffiths,et al.  Online Inference of Topics with Latent Dirichlet Allocation , 2009, AISTATS.

[92]  Michael J. Paul,et al.  Topic Modeling of Research Fields: An Interdisciplinary Perspective , 2009, RANLP.

[93]  Max Welling,et al.  Distributed Algorithms for Topic Models , 2009, J. Mach. Learn. Res..

[94]  Paul A. Pavlou,et al.  Overcoming the J-shaped distribution of product reviews , 2009, CACM.

[95]  Ramesh Nallapati,et al.  Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora , 2009, EMNLP.

[96]  Michael J. Paul,et al.  Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models , 2009, EMNLP.

[97]  Andrew McCallum,et al.  Efficient methods for topic model inference on streaming document collections , 2009, KDD.

[98]  Yee Whye Teh,et al.  On Smoothing and Inference for Topic Models , 2009, UAI.

[99]  Ruslan Salakhutdinov,et al.  Evaluation methods for topic models , 2009, ICML '09.

[100]  Xiaojin Zhu,et al.  Incorporating domain knowledge into topic modeling via Dirichlet Forest priors , 2009, ICML '09.

[101]  Lucy Vanderwende,et al.  Exploring Content Models for Multi-Document Summarization , 2009, NAACL.

[102]  Yue Lu,et al.  Rated aspect summarization of short comments , 2009, WWW '09.

[103]  Junzhou Huang,et al.  Learning with structured sparsity , 2009, ICML '09.

[104]  Michael J. Paul,et al.  Cross-Collection Topic Models : Automatically Comparing and Contrasting Text , 2009 .

[105]  David M. Blei,et al.  Syntactic Topic Models , 2008, NIPS.

[106]  E. Fisher,et al.  Regional and Racial Variation in Health Care Among Medicare Beneficiaries , 2008 .

[107]  Daniel Jurafsky,et al.  Studying the History of Ideas Using Topic Models , 2008, EMNLP.

[108]  Thomas D. Sequist,et al.  Quality Monitoring of Physicians: Linking Patients’ Experiences of Care to Clinical Quality and Outcomes , 2008, Journal of General Internal Medicine.

[109]  Max Welling,et al.  Fast collapsed gibbs sampling for latent dirichlet allocation , 2008, KDD.

[110]  Andrew McCallum,et al.  Topic Models Conditioned on Arbitrary Features with Dirichlet-multinomial Regression , 2008, UAI.

[111]  Ivan Titov,et al.  A Joint Model of Text and Aspect Ratings for Sentiment Summarization , 2008, ACL.

[112]  P. Bühlmann,et al.  The group lasso for logistic regression , 2008 .

[113]  Fernando Pereira,et al.  Generating summary keywords for emails using topics , 2008, IUI '08.

[114]  Hanna Wallach,et al.  Structured Topic Models for Language , 2008 .

[115]  Wei Li,et al.  Mixtures of hierarchical topics with Pachinko allocation , 2007, ICML '07.

[116]  John D. Lafferty,et al.  A correlated topic model of Science , 2007, 0708.3601.

[117]  Xu Ling,et al.  Topic sentiment mixture: modeling facets and opinions in weblogs , 2007, WWW '07.

[118]  Tom M. Mitchell,et al.  The Need for Biases in Learning Generalizations , 2007 .

[119]  Wei Li,et al.  Pachinko allocation: DAG-structured mixture models of topic correlations , 2006, ICML.

[120]  Hanna M. Wallach,et al.  Topic modeling: beyond bag-of-words , 2006, ICML.

[121]  Fabrizio Schifano,et al.  Drugs on the web; the Psychonaut 2002 EU project , 2006, Progress in Neuro-Psychopharmacology and Biological Psychiatry.

[122]  Thomas L. Griffiths,et al.  Infinite latent feature models and the Indian buffet process , 2005, NIPS.

[123]  Richard C. Lindrooth,et al.  Dirichlet-Multinomial Regression , 2005 .

[124]  S. Sofaer,et al.  Patient perceptions of the quality of health services. , 2005, Annual review of public health.

[125]  Thomas L. Griffiths,et al.  Integrating Topics and Syntax , 2004, NIPS.

[126]  Bei Yu,et al.  A cross-collection mixture model for comparative text mining , 2004, KDD.

[127]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[128]  Noah A. Smith,et al.  Annealing Techniques For Unsupervised Statistical Language Learning , 2004, ACL.

[129]  Bo Pang,et al.  A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts , 2004, ACL.

[130]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.

[131]  Keith T. Poole,et al.  Measuring Bias and Uncertainty in Ideal Point Estimates via the Parametric Bootstrap , 2004, Political Analysis.

[132]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[133]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[134]  Radford M. Neal,et al.  A Split-Merge Markov chain Monte Carlo Procedure for the Dirichlet Process Mixture Model , 2004 .

[135]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[136]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[137]  Michael I. Jordan,et al.  Factorial Hidden Markov Models , 1995, Machine Learning.

[138]  Thomas L. Griffiths,et al.  Hierarchical Topic Models and the Nested Chinese Restaurant Process , 2003, NIPS.

[139]  Jeff A. Bilmes,et al.  Factored Language Models and Generalized Parallel Backoff , 2003, NAACL.

[140]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[141]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[142]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[143]  P. Wax Just a click away: recreational drug Web sites on the Internet. , 2002, Pediatrics.

[144]  A M Zaslavsky,et al.  National Quality Monitoring of Medicare Health Plans: The Relationship Between Enrollees’ Reports and the Quality of Clinical Care , 2001, Medical care.

[145]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[146]  Nando de Freitas,et al.  Sequential Monte Carlo Methods in Practice , 2001, Statistics for Engineering and Information Science.

[147]  Jaime Carbonell,et al.  Multi-Document Summarization By Sentence Extraction , 2000 .

[148]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[149]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[150]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[151]  Yoav Freund,et al.  Large Margin Classification Using the Perceptron Algorithm , 1998, COLT' 98.

[152]  Naonori Ueda,et al.  Deterministic annealing EM algorithm , 1998, Neural Networks.

[153]  Sylvia Richardson,et al.  Markov Chain Monte Carlo in Practice , 1997 .

[154]  Manfred K. Warmuth,et al.  Exponentiated Gradient Versus Gradient Descent for Linear Predictors , 1997, Inf. Comput..

[155]  H. Sebastian Seung,et al.  Unsupervised Learning by Convex and Conic Coding , 1996, NIPS.

[156]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[157]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[158]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[159]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[160]  Alastair J. Walker,et al.  An Efficient Method for Generating Discrete Random Variables with General Distributions , 1977, TOMS.

[161]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[162]  W. K. Hastings,et al.  Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[163]  M. T. Wasan Stochastic Approximation , 1969 .

[164]  F. Downton Stochastic Approximation , 1969, Nature.

[165]  Howard Raiffa,et al.  Applied Statistical Decision Theory. , 1961 .

[166]  H. Robbins An Empirical Bayes Approach to Statistics , 1956 .

[167]  N. Metropolis,et al.  Equation of State Calculations by Fast Computing Machines , 1953, Resonance.