Discriminative machine learning with structure

Some of the best performing classifiers in modern machine learning have been designed using discriminative learning, as exemplified by Support Vector Machines. The ability of discriminative learning to use flexible features via the kernel trick has enlarged the possible set of applications for machine learning. With the expanded range of possible applications though, it has become apparent that real world data exhibits more structure than has been assumed by classical methods. In this thesis, we show how to extend the discriminative learning framework to exploit different types of structure: on one hand, the structure on outputs, such as the combinatorial structure in word alignment; on the other hand, a latent variable structure on inputs, such as in text document classification. In the context of structured output classification, we present a scalable algorithm for maximum-margin estimation of structured output models, including an important class of Markov networks and combinatorial models. We formulate the estimation problem as a convex-concave saddle-point problem that allows us to use simple projection methods based on the dual extragradient algorithm of Nesterov. We analyze the convergence of the method and present experiments on two very different structured prediction tasks: 3D image segmentation and word alignment. We then show how one can obtain state-of-the-art results for the word alignment task by formulating it as a quadratic assignment problem within our discriminative learning framework. In the context of latent variable models, we present DiscLDA, a discriminative variant of the Latent Dirichlet Allocation (LDA) model which has been popular to model collections of text documents or images. In DiscLDA, we introduce a class-dependent linear transformation on the topic mixture proportions of LDA and estimate it discriminatively by maximizing the conditional likelihood. By using the transformed topic mixture proportions as a new representation of documents, we obtain a supervised dimensionality reduction algorithm that uncovers the latent structure in a document collection while preserving predictive power for the task of classification. Our experiments on the 20 Newsgroups document classification task show how our model can identify shared topics across classes as well as discriminative class-dependent topics.

[1]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[2]  Ted Pedersen,et al.  An Evaluation Exercise for Word Alignment , 2003, ParallelTexts@NAACL-HLT.

[3]  Michael I. Jordan,et al.  A latent variable model for chemogenomic profiling , 2005, Bioinform..

[4]  Michael I. Jordan,et al.  DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification , 2008, NIPS.

[5]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[6]  M. Newton Approximate Bayesian-inference With the Weighted Likelihood Bootstrap , 1994 .

[7]  Hermann Ney,et al.  HMM-Based Word Alignment in Statistical Translation , 1996, COLING.

[8]  D. Greig,et al.  Exact Maximum A Posteriori Estimation for Binary Images , 1989 .

[9]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[10]  Elad Hazan,et al.  Logarithmic regret algorithms for online convex optimization , 2006, Machine Learning.

[11]  G. M. Korpelevich The extragradient method for finding saddle points and other problems , 1976 .

[12]  Ben Taskar,et al.  Structured Prediction via the Extragradient Method , 2005, NIPS.

[13]  Ben Taskar,et al.  Alignment by Agreement , 2006, NAACL.

[14]  Andrew McCallum,et al.  Topic Models Conditioned on Arbitrary Features with Dirichlet-multinomial Regression , 2008, UAI.

[15]  Eric P. Xing,et al.  MedLDA: maximum margin supervised topic models for regression and classification , 2009, ICML '09.

[16]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[17]  A. Gelfand,et al.  Bayesian Model Choice: Asymptotics and Exact Calculations , 1994 .

[18]  Pierre Baldi,et al.  Large-Scale Prediction of Disulphide Bond Connectivity , 2004, NIPS.

[19]  Y. Nesterov Dual Extrapolation and its Applications for Solving Variational Inequalities and Related Problems' , 2003 .

[20]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[21]  Robert C. Moore A Discriminative Framework for Bilingual Word Alignment , 2005, HLT.

[22]  L. Mark Berliner,et al.  Subsampling the Gibbs Sampler , 1994 .

[23]  Michael I. Jordan,et al.  An asymptotic analysis of generative, discriminative, and pseudolikelihood estimators , 2008, ICML '08.

[24]  Koby Crammer,et al.  Online Passive-Aggressive Algorithms , 2003, J. Mach. Learn. Res..

[25]  L. Williams,et al.  Contents , 2020, Ophthalmology (Rochester, Minn.).

[26]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[27]  Dimitri P. Bertsekas,et al.  Network optimization : continuous and discrete models , 1998 .

[28]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[29]  Radford M. Neal Annealed importance sampling , 1998, Stat. Comput..

[30]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[31]  Ben Taskar,et al.  Discriminative learning of Markov random fields for segmentation of 3D scan data , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[32]  Martin J. Wainwright,et al.  MAP estimation via agreement on (hyper)trees: Message-passing and linear programming , 2005, ArXiv.

[33]  Robert Tibshirani,et al.  The Entire Regularization Path for the Support Vector Machine , 2004, J. Mach. Learn. Res..

[34]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[35]  Michael I. Jordan,et al.  Kernel dimension reduction in regression , 2009, 0908.1854.

[36]  Tim Hesterberg,et al.  Monte Carlo Strategies in Scientific Computing , 2002, Technometrics.

[37]  Nathan Ratliff,et al.  Online) Subgradient Methods for Structured Prediction , 2007 .

[38]  Paul Tseng,et al.  An ε-Relaxation Method for Separable Convex Cost Network Flow Problems , 1997, SIAM J. Optim..

[39]  Vladimir Kolmogorov,et al.  What energy functions can be minimized via graph cuts? , 2002, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Martin J. Wainwright,et al.  MAP estimation via agreement on trees: message-passing and linear programming , 2005, IEEE Transactions on Information Theory.

[41]  Koby Crammer,et al.  Global Discriminative Learning for Higher-Accuracy Computational Gene Prediction , 2007, PLoS Comput. Biol..

[42]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.

[43]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.

[44]  Joseph Naor,et al.  A Linear Programming Formulation and Approximation Algorithms for the Metric Labeling Problem , 2005, SIAM J. Discret. Math..

[45]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[46]  Yee Whye Teh,et al.  Names and faces in the news , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[47]  Xiao-Li Meng,et al.  SIMULATING RATIOS OF NORMALIZING CONSTANTS VIA A SIMPLE IDENTITY: A THEORETICAL EXPLORATION , 1996 .

[48]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[49]  Ben Taskar,et al.  Exponentiated Gradient Algorithms for Large-margin Structured Classification , 2004, NIPS.

[50]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[51]  R. Cook,et al.  Sufficient Dimension Reduction and Graphics in Regression , 2002 .

[52]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[53]  Ruslan Salakhutdinov,et al.  Evaluation methods for topic models , 2009, ICML '09.

[54]  Saharon Rosset,et al.  Tracking Curved Regularized Optimization Solution Paths , 2004, NIPS 2004.

[55]  Ben Taskar,et al.  A Discriminative Matching Approach to Word Alignment , 2005, HLT.

[56]  Leslie G. Valiant,et al.  The Complexity of Computing the Permanent , 1979, Theor. Comput. Sci..

[57]  Joseph Naor,et al.  Approximation algorithms for the metric labeling problem via a new linear programming formulation , 2001, SODA '01.

[58]  Ryan T. McDonald,et al.  Scalable Large-Margin Online Learning for Structured Classification , 2005 .

[59]  P. Tseng,et al.  Implementation and Test of Auction Methods for Solving Generalized Network Flow Problems with Separable Convex Cost , 2002 .

[60]  Tommi S. Jaakkola,et al.  Maximum Entropy Discrimination , 1999, NIPS.

[61]  Ben Taskar,et al.  Max-Margin Markov Networks , 2003, NIPS.

[62]  Mark Jerrum,et al.  Polynomial-Time Approximation Algorithms for the Ising Model , 1990, SIAM J. Comput..

[63]  Ben Taskar,et al.  Learning structured prediction models: a large margin approach , 2005, ICML.

[64]  Ben Taskar,et al.  Learning associative Markov networks , 2004, ICML.

[65]  Yoshua Bengio,et al.  Classification using discriminative restricted Boltzmann machines , 2008, ICML '08.

[66]  Hermann Ney,et al.  Symmetric Word Alignments for Statistical Machine Translation , 2004, COLING.

[67]  Koby Crammer,et al.  Ultraconservative Online Algorithms for Multiclass Problems , 2001, J. Mach. Learn. Res..

[68]  L. Liao,et al.  Improvements of Some Projection Methods for Monotone Nonlinear Variational Inequalities , 2002 .

[69]  Xiao-Li Meng,et al.  Simulating Normalizing Constants: From Importance Sampling to Bridge Sampling to Path Sampling , 1998 .

[70]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[71]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[72]  John Langford,et al.  Search-based structured prediction , 2009, Machine Learning.

[73]  Fu Jie Huang,et al.  A Tutorial on Energy-Based Learning , 2006 .

[74]  Martin J. Wainwright,et al.  On the Optimality of Tree-reweighted Max-product Message-passing , 2005, UAI.

[75]  Mario Peruggia,et al.  Subsampling the Gibbs sampler: variance reduction , 2000 .

[76]  Jitendra Malik,et al.  Shape matching and object recognition using shape contexts , 2010, 2010 3rd International Conference on Computer Science and Information Technology.

[77]  Ben Taskar,et al.  Word Alignment via Quadratic Assignment , 2006, NAACL.

[78]  Ben Taskar,et al.  Structured Prediction, Dual Extragradient and Bregman Projections , 2006, J. Mach. Learn. Res..

[79]  Xiaojin Zhu,et al.  Kernel conditional random fields: representation and clique selection , 2004, ICML.

[80]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[81]  Koby Crammer,et al.  Online Large-Margin Training of Dependency Parsers , 2005, ACL.

[82]  Thomas Hofmann,et al.  Hidden Markov Support Vector Machines , 2003, ICML.

[83]  Yurii Nesterov,et al.  Dual extrapolation and its applications to solving variational inequalities and related problems , 2003, Math. Program..

[84]  Thorsten Joachims,et al.  Training structural SVMs when exact inference is intractable , 2008, ICML '08.

[85]  Alexander Schrijver,et al.  Combinatorial optimization. Polyhedra and efficiency. , 2003 .

[86]  Yee Whye Teh,et al.  On Smoothing and Inference for Topic Models , 2009, UAI.

[87]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[88]  Neill W Campbell,et al.  IEEE International Conference on Computer Vision and Pattern Recognition , 2008 .

[89]  T. Speed,et al.  Biological Sequence Analysis , 1998 .

[90]  Yee Whye Teh,et al.  A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation , 2006, NIPS.

[91]  Martial Hebert,et al.  Discriminative Fields for Modeling Spatial Dependencies in Natural Images , 2003, NIPS.

[92]  M. F.,et al.  Bibliography , 1985, Experimental Gerontology.