Prediction-Constrained Training for Semi-Supervised Mixture and Topic Models

Supervisory signals have the potential to make low-dimensional data representations, like those learned by mixture and topic models, more interpretable and useful. We propose a framework for training latent variable models that explicitly balances two goals: recovery of faithful generative explanations of high-dimensional data, and accurate prediction of associated semantic labels. Existing approaches fail to achieve these goals due to an incomplete treatment of a fundamental asymmetry: the intended application is always predicting labels from data, not data from labels. Our prediction-constrained objective for training generative models coherently integrates loss-based supervisory signals while enabling effective semi-supervised learning from partially labeled data. We derive learning algorithms for semi-supervised mixture and topic models using stochastic gradient descent with automatic differentiation. We demonstrate improved prediction quality compared to several previous supervised topic models, achieving predictions competitive with high-dimensional logistic regression on text sentiment analysis and electronic health records tasks while simultaneously learning interpretable topics.

[1]  Gideon S. Mann,et al.  Simple, robust, scalable semi-supervised learning via expectation regularization , 2007, ICML '07.

[2]  Thomas L. Griffiths,et al.  Learning Systems of Concepts with an Infinite Relational Model , 2006, AAAI.

[3]  Slobodan Vucetic,et al.  Supervised clustering of label ranking data using label preference information , 2013, Machine Learning.

[4]  Geoffrey E. Hinton,et al.  Parameter estimation for linear dynamical systems , 1996 .

[5]  Yoni Halpern A Comparison of Dimensionality Reduction Techniques for Unstructured Clinical Text , 2012 .

[6]  Amit Dhurandhar,et al.  Uncovering Group Level Insights with Accordant Clustering , 2017, SDM.

[7]  Warren B. Powell,et al.  Dirichlet Process Mixtures of Generalized Linear Models , 2009, J. Mach. Learn. Res..

[8]  Antoine Cornuéjols,et al.  Supervised Pre-processings Are Useful for Supervised Clustering , 2014, ECDA.

[9]  Thomas M. DiCicco,et al.  Machine Classification of Prosodic Control in Dysarthria , 2010 .

[10]  Manfred K. Warmuth,et al.  Exponentiated Gradient Versus Gradient Descent for Linear Predictors , 1997, Inf. Comput..

[11]  Sam T. Roweis,et al.  EM Algorithms for PCA and SPCA , 1997, NIPS.

[12]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[13]  Ben Taskar,et al.  Posterior Regularization for Structured Latent Variable Models , 2010, J. Mach. Learn. Res..

[14]  James O. Berger,et al.  Modularization in Bayesian analysis, with emphasis on analysis of computer models , 2009 .

[15]  Ben Taskar,et al.  Expectation Maximization and Posterior Constraints , 2007, NIPS.

[16]  B. Everitt,et al.  Finite Mixture Distributions , 1981 .

[17]  R. Geetha Ramani,et al.  Improved Classification of Lung Cancer Tumors Based on Structural and Physicochemical Properties of Proteins Using Data Mining Models , 2013, PloS one.

[18]  Michael C. Hughes,et al.  Supervised topic models for clinical interpretability , 2016, 1612.01678.

[19]  Yelong Shen,et al.  End-to-end Learning of LDA by Mirror-Descent Back Propagation over a Deep Architecture , 2015, NIPS.

[20]  Charles A. Sutton,et al.  Autoencoding Variational Inference For Topic Models , 2017, ICLR.

[21]  R. Shumway,et al.  AN APPROACH TO TIME SERIES SMOOTHING AND FORECASTING USING THE EM ALGORITHM , 1982 .

[22]  Alvaro Soto,et al.  A proposal for supervised clustering with Dirichlet Process using labels , 2016, Pattern Recognit. Lett..

[23]  Sanjeev Arora,et al.  A Practical Algorithm for Topic Modeling with Provable Guarantees , 2012, ICML.

[24]  Mihaela van der Schaar,et al.  Personalized Donor-Recipient Matching for Organ Transplantation , 2016, AAAI.

[25]  Eric P. Xing,et al.  MedLDA: maximum margin supervised topic models , 2012, J. Mach. Learn. Res..

[26]  Bo Pang,et al.  Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales , 2005, ACL.

[27]  Sebastian Thrun,et al.  Learning to Classify Text from Labeled and Unlabeled Documents , 1998, AAAI/IAAI.

[28]  Victor J. Rayward-Smith,et al.  Adapting k-means for supervised clustering , 2006, Applied Intelligence.

[29]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[30]  Lawrence Carin,et al.  Semisupervised Learning of Hidden Markov Models via a Homotopy Method , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Gideon S. Mann,et al.  Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data , 2010, J. Mach. Learn. Res..

[32]  Ning Chen,et al.  Gibbs Max-Margin Topic Models with Fast Sampling Algorithms , 2013, ICML.

[33]  Michael I. Jordan,et al.  Supervised learning from incomplete data via an EM approach , 1993, NIPS.

[34]  Sylvia Richardson,et al.  PReMiuM: An R Package for Profile Regression Mixture Models Using Dirichlet Processes. , 2013, Journal of statistical software.

[35]  Alex Pentland,et al.  Maximum Conditional Likelihood via Bound Maximization and the CEM Algorithm , 1998, NIPS.

[36]  Yuchung J. Wang,et al.  Stochastic Blockmodels for Directed Graphs , 1987 .

[37]  Ramesh Nallapati,et al.  Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora , 2009, EMNLP.

[38]  Robert D. Nowak,et al.  Wavelet-based statistical signal processing using hidden Markov models , 1998, IEEE Trans. Signal Process..

[39]  Jun Zhu,et al.  Spectral Methods for Supervised Topic Models , 2014, NIPS.

[40]  David B. Dunson,et al.  Probabilistic topic models , 2012, Commun. ACM.

[41]  Alvaro Soto,et al.  Enhancing K-Means using class labels , 2013, Intell. Data Anal..

[42]  Christoph F. Eick,et al.  Supervised clustering - algorithms and benefits , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.

[43]  Thomas L. Griffiths,et al.  Infinite latent feature models and the Indian buffet process , 2005, NIPS.

[44]  Daniel M. Roy,et al.  Complexity of Inference in Latent Dirichlet Allocation , 2011, NIPS.

[45]  Andreas Krause,et al.  Discriminative Clustering by Regularized Information Maximization , 2010, NIPS.

[46]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[47]  Andrew McCallum,et al.  Topic Models Conditioned on Arbitrary Features with Dirichlet-multinomial Regression , 2008, UAI.

[48]  Babak Shahbaba,et al.  Nonlinear Models Using Dirichlet Process Mixtures , 2007, J. Mach. Learn. Res..

[49]  Nando de Freitas,et al.  An Introduction to MCMC for Machine Learning , 2004, Machine Learning.

[50]  Michael E. Tipping,et al.  Probabilistic Principal Component Analysis , 1999 .

[51]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[52]  L. Rabiner,et al.  An introduction to hidden Markov models , 1986, IEEE ASSP Magazine.

[53]  Thorsten Joachims,et al.  Supervised clustering with support vector machines , 2005, ICML.

[54]  Adrian Corduneanu,et al.  Continuation Methods for Mixing Heterogenous Sources , 2002, UAI.

[55]  Alex Pentland,et al.  Discriminative, generative and imitative learning , 2002 .

[56]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[57]  Michael I. Jordan,et al.  DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification , 2008, NIPS.

[58]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.

[59]  Ning Chen,et al.  Bayesian inference with posterior regularization and applications to infinite latent SVMs , 2012, J. Mach. Learn. Res..

[60]  Ming-Wei Chang,et al.  Guiding Semi-Supervision with Constraint-Driven Learning , 2007, ACL.

[61]  Martyn Plummer Cuts in Bayesian graphical models , 2015, Stat. Comput..

[62]  Sylvia Richardson,et al.  Using Bayesian graphical models to model biases in observational studies and to combine multiple sources of data: application to low birth weight and water disinfection by‐products , 2009 .

[63]  Tommi S. Jaakkola,et al.  Maximum Entropy Discrimination , 1999, NIPS.

[64]  Matt Taddy,et al.  On Estimation and Selection for Topic Models , 2011, AISTATS.

[65]  Tao Mei,et al.  Travel Recommendation via Author Topic Model Based Collaborative Filtering , 2015, MMM.

[66]  Hedvig Kjellström,et al.  How to Supervise Topic Models , 2014, ECCV Workshops.

[67]  Yong Ren,et al.  Spectral Learning for Supervised Topic Models , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[68]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[69]  Sylvia Richardson,et al.  Bayesian profile regression with an application to the National Survey of Children's Health. , 2010, Biostatistics.

[70]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[71]  Francis R. Bach,et al.  Robust Discriminative Clustering with Sparse Regularizers , 2017, J. Mach. Learn. Res..