On Parameter Learning for Perturb-and-MAP Models. (Sur l'apprentissage des paramètres pour les modèles Perturb-and-MAP)

Les modeles graphiques probabilistes codent les dependances cachees entre les variables aleatoires pour la modelisation des donnees. L'estimation des parametres est une partie cruciale et necessaire du traitement de ces modeles probabilistes. Ces modeles tres generaux ont ete utilises dans de nombreux domaines tels que la vision par ordinateur, le traitement du signal, le traitement du langage naturel et bien d'autres. Nous nous sommes surtout concentres sur les modeles log-supermodulaires, qui constituent une partie specifique des distributions familiales exponentielles, ou la fonction potentielle est supposee etre negative d'une fonction sous-modulaire. Cette propriete sera tres pratique pour le maximum d'estimations a posteriori et d'apprentissage des parametres. Malgre la restriction apparente des modeles d'interet, ils couvrent une grande partie des familles exponentielles, puisqu'il y a beaucoup de fonctions qui sont sous-modulaires, par exemple, les coupes graphiques, l'entropie et autres. Il est bien connu que le traitement probabiliste est un defi pour la plupart des modeles, mais nous avons ete en mesure de relever certains des defis au moins approximativement. Dans ce manuscrit, nous exploitons les idees perturb-and-MAP pour l'approximation de la fonction de partition et donc un apprentissage efficace des parametres. De plus, le probleme peut egalement etre interprete comme une tâche d'apprentissage de structure, ou chaque parametre ou poids estime represente l'importance du terme correspondant. Nous proposons une methode d'estimation et d'inference approximative des parametres pour les modeles ou l'apprentissage et l'inference exacts sont insolubles dans le cas general en raison de la complexite du calcul des fonctions de partition. La premiere partie de la these est consacree aux garanties theoriques. Etant donne les modeles log-supermodulaires, nous tirons parti de la propriete de minimisation efficace liee a la sous-modularite. En introduisant et en comparant deux limites superieures existantes de la fonction de partition, nous sommes en mesure de demontrer leur relation en prouvant un resultat theorique. Nous introduisons une approche pour les donnees manquantes comme sous-routine naturelle de la modelisation probabiliste. Il semble que nous puissions appliquer une technique stochastique a l'approche d'approximation par perturbation et carte proposee tout en maintenant la convergence tout en la rendant plus rapide dans la pratique. La deuxieme contribution principale de cette these est une generalisation efficace et evolutive de l'approche de l'apprentissage parametrique. Dans cette section, nous developpons de nouveaux algorithmes pour effectuer l'estimation des parametres pour diverses fonctions de perte, differents niveaux de supervision et nous travaillons egalement sur l'evolutivite. En particulier, en travaillant principalement avec des coupes graphiques, nous avons pu integrer differentes techniques d'acceleration. Nous traitons d'un probleme general d'apprentissage des signaux continus. Dans cette partie, nous nous concentrons sur les representations de modeles graphiques clairsemes. Nous utilisons des regularisateurs a faible densite commune comme potentiels bases sur les antecedents. Les techniques de debruitage proposees ne necessitent pas le choix d'un redresseur precis a l'avance. Pour effectuer l'apprentissage de la representation clairsemee, la communaute utilise souvent les pertes symetriques comme l1, mais nous proposons de parametrer la perte et d'apprendre le poids de chaque composante de perte a partir des donnees. C'est possible grâce a l'approche que nous avons proposee dans les sections precedentes. Pour tous les aspects de l'estimation des parametres mentionnes ci-dessus, nous avons effectue les calculs suivants des experiences nationales visant a approuver l'idee ou a la comparer a des reperes existants, et a demontrer sa performance dans la pratique.

[1]  Volkan Cevher,et al.  Combinatorial Penalties: Which structures are preserved by convex relaxations? , 2017, AISTATS.

[2]  Hiroshi Ishikawa,et al.  Exact Optimization for Markov Random Fields with Convex Priors , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  Pushmeet Kohli,et al.  Robust Higher Order Potentials for Enforcing Label Consistency , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Julien Mairal,et al.  Structured sparsity through convex optimization , 2011, ArXiv.

[5]  J. M. Hammersley,et al.  Markov fields on finite graphs and lattices , 1971 .

[6]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[7]  Shimon Ullman,et al.  Combining Top-Down and Bottom-Up Segmentation , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[8]  Vladimir Kolmogorov,et al.  What energy functions can be minimized via graph cuts? , 2002, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Anton Osokin,et al.  Submodular Relaxation for Inference in Markov Random Fields , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Michael I. Jordan Graphical Models , 2003 .

[11]  Michael I. Jordan,et al.  Loopy Belief Propagation for Approximate Inference: An Empirical Study , 1999, UAI.

[12]  Matthew B. Blaschko,et al.  Learning Submodular Losses with the Lovasz Hinge , 2015, ICML.

[13]  Ben Taskar,et al.  Max-Margin Markov Networks , 2003, NIPS.

[14]  Gökhan BakIr,et al.  Predicting Structured Data , 2008 .

[15]  Alan L. Yuille,et al.  The Concave-Convex Procedure , 2003, Neural Computation.

[16]  Mark W. Schmidt,et al.  Block-Coordinate Frank-Wolfe Optimization for Structural SVMs , 2012, ICML.

[17]  Tom Heskes,et al.  Fractional Belief Propagation , 2002, NIPS.

[18]  Andrew McCallum,et al.  Piecewise Training for Undirected Models , 2005, UAI.

[19]  Max Welling,et al.  Learning in Markov Random Fields An Empirical Study , 2005 .

[20]  Nando de Freitas,et al.  An Introduction to Sequential Monte Carlo Methods , 2001, Sequential Monte Carlo Methods in Practice.

[21]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[22]  R. B. Potts Some generalized order-disorder transformations , 1952, Mathematical Proceedings of the Cambridge Philosophical Society.

[23]  Andreas Krause,et al.  Higher-Order Inference for Multi-class Log-Supermodular Models , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[24]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[25]  Andrew McCallum,et al.  Piecewise pseudolikelihood for efficient training of conditional random fields , 2007, ICML '07.

[26]  Eric Horvitz,et al.  Considering Cost Asymmetry in Learning Classifiers , 2006, J. Mach. Learn. Res..

[27]  Tommi S. Jaakkola,et al.  Tightening LP Relaxations for MAP using Message Passing , 2008, UAI.

[28]  Francis R. Bach,et al.  Structured sparsity-inducing norms through submodular functions , 2010, NIPS.

[29]  Davies Rémi Gribonval Restricted Isometry Constants Where Lp Sparse Recovery Can Fail for 0 , 2008 .

[30]  Andreas Krause,et al.  Scalable Variational Inference in Log-supermodular Models , 2015, ICML.

[31]  M. Opper,et al.  Comparing the Mean Field Method and Belief Propagation for Approximate Inference in MRFs , 2001 .

[32]  Guillermo Sapiro,et al.  Online dictionary learning for sparse coding , 2009, ICML '09.

[33]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[34]  Nikos Komodakis,et al.  MRF Energy Minimization and Beyond via Dual Decomposition , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Lawrence Carin,et al.  Exploiting Structure in Wavelet-Based Bayesian Compressive Sensing , 2009, IEEE Transactions on Signal Processing.

[36]  Yee Whye Teh,et al.  An Alternate Objective Function for Markovian Fields , 2002, ICML.

[37]  Vladimir Kolmogorov,et al.  An experimental comparison of min-cut/max- flow algorithms for energy minimization in vision , 2001, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Sebastian Nowozin,et al.  Putting MAP Back on the Map , 2011, DAGM-Symposium.

[39]  H. Groenevelt Two algorithms for maximizing a separable concave function over a polymatroid feasible region , 1991 .

[40]  Mark W. Schmidt,et al.  Accelerated training of conditional random fields with stochastic gradient methods , 2006, ICML.

[41]  Endre Boros,et al.  Pseudo-Boolean optimization , 2002, Discret. Appl. Math..

[42]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[43]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[44]  Daphne Koller,et al.  Self-Paced Learning for Latent Variable Models , 2010, NIPS.

[45]  Jean-Philippe Vert,et al.  Group lasso with overlap and graph lasso , 2009, ICML '09.

[46]  Tommi S. Jaakkola,et al.  On the Partition Function and Random Maximum A-Posteriori Perturbations , 2012, ICML.

[47]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[48]  Sebastian Nowozin,et al.  Structured Learning and Prediction in Computer Vision , 2011, Found. Trends Comput. Graph. Vis..

[49]  Hui Lin,et al.  A Class of Submodular Functions for Document Summarization , 2011, ACL.

[50]  George Papandreou,et al.  Perturb-and-MAP random fields: Using discrete optimization to learn and sample from energy models , 2011, 2011 International Conference on Computer Vision.

[51]  Francis R. Bach,et al.  Learning the Structure for Structured Sparsity , 2014, IEEE Transactions on Signal Processing.

[52]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[53]  Anton Osokin,et al.  Minding the Gaps for Block Frank-Wolfe Optimization of Structured SVMs , 2016, ICML.

[54]  Andreas Krause,et al.  Adaptive Submodularity: Theory and Applications in Active Learning and Stochastic Optimization , 2010, J. Artif. Intell. Res..

[55]  Rémi Gribonval,et al.  Should Penalized Least Squares Regression be Interpreted as Maximum A Posteriori Estimation? , 2011, IEEE Transactions on Signal Processing.

[56]  Francis R. Bach,et al.  Convex Relaxation for Combinatorial Penalties , 2012, ArXiv.

[57]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[58]  William T. Freeman,et al.  Constructing free-energy approximations and generalized belief propagation algorithms , 2005, IEEE Transactions on Information Theory.

[59]  Hoon Kim,et al.  Monte Carlo Statistical Methods , 2000, Technometrics.

[60]  Noah A. Smith Linguistic Structure Prediction , 2011, Synthesis Lectures on Human Language Technologies.

[61]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[62]  Eric P. Xing,et al.  Tree-Guided Group Lasso for Multi-Task Regression with Structured Sparsity , 2009, ICML.

[63]  Francis R. Bach,et al.  Parameter Learning for Log-supermodular Distributions , 2016, NIPS.

[64]  László Lovász,et al.  Submodular functions and convexity , 1982, ISMP.

[65]  Anton Osokin,et al.  Marginal Weighted Maximum Log-likelihood for Efficient Learning of Perturb-and-Map models , 2018, UAI.

[66]  Andreas Krause,et al.  Learning Probabilistic Submodular Diversity Models Via Noise Contrastive Estimation , 2016, AISTATS.

[67]  Samuel Kotz,et al.  A generalized logistic distribution , 2005, Int. J. Math. Math. Sci..

[68]  David A. McAllester,et al.  Object Detection with Grammar Models , 2011, NIPS.

[69]  Francis R. Bach,et al.  Learning with Submodular Functions: A Convex Optimization Perspective , 2011, Found. Trends Mach. Learn..

[70]  Richard Szeliski,et al.  A Comparative Study of Energy Minimization Methods for Markov Random Fields with Smoothness-Based Priors , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[71]  Derek Hoiem,et al.  Learning CRFs Using Graph Cuts , 2008, ECCV.

[72]  I JordanMichael,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008 .

[73]  Andreas Krause,et al.  From MAP to Marginals: Variational Inference in Bayesian Submodular Models , 2014, NIPS.

[74]  Pushmeet Kohli,et al.  Tractability: Practical Approaches to Hard Problems , 2013 .

[75]  Francis Bach,et al.  Submodular functions: from discrete to continuous domains , 2015, Mathematical Programming.

[76]  Vladimir Kolmogorov,et al.  Convergent Tree-Reweighted Message Passing for Energy Minimization , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[77]  藤重 悟 Submodular functions and optimization , 1991 .

[78]  Olga Veksler,et al.  Fast approximate energy minimization via graph cuts , 2001, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[79]  Pushmeet Kohli,et al.  Dynamic Graph Cuts for Efficient Inference in Markov Random Fields , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[80]  Volkan Cevher,et al.  Model-Based Compressive Sensing , 2008, IEEE Transactions on Information Theory.

[81]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[82]  Thorsten Joachims,et al.  Learning structural SVMs with latent variables , 2009, ICML '09.

[83]  P. Zhao,et al.  Grouped and Hierarchical Model Selection through Composite Absolute Penalties , 2007 .

[84]  Matthieu Guillaumin,et al.  Closed-Form Training of Conditional Random Fields for Large Scale Image Segmentation , 2014, ArXiv.

[85]  Ryan P. Adams,et al.  Randomized Optimum Models for Structured Prediction , 2012, AISTATS.

[86]  S. Karlin,et al.  Classes of orderings of measures and related correlation inequalities. I. Multivariate totally positive distributions , 1980 .

[87]  Subhransu Maji,et al.  Learning Efficient Random Maximum A-Posteriori Predictors with Non-Decomposable Loss Functions , 2013, NIPS.

[88]  Mark Jerrum,et al.  Polynomial-Time Approximation Algorithms for the Ising Model , 1990, SIAM J. Comput..

[89]  E. Ising Beitrag zur Theorie des Ferromagnetismus , 1925 .

[90]  Ben Taskar,et al.  Learning structured prediction models: a large margin approach , 2005, ICML.

[91]  Junzhou Huang,et al.  Learning with structured sparsity , 2009, ICML '09.

[92]  Andreas Krause,et al.  Submodular Function Maximization , 2014, Tractability.