A Tutorial on Energy-Based Learning

Energy-Based Models (EBMs) capture dependencies between variables by associating a scalar energy to each configuration of the variab les. Inference consists in clamping the value of observed variables and finding config urations of the remaining variables that minimize the energy. Learning consists in finding an energy function in which observed configurations of the variables a re given lower energies than unobserved ones. The EBM approach provides a common theoretical framework for many learning models, including traditional discr iminative and generative approaches, as well as graph-transformer networks, co nditional random fields, maximum margin Markov networks, and several manifold learning methods. Probabilistic models must be properly normalized, which sometimes requires evaluating intractable integrals over the space of all poss ible variable configurations. Since EBMs have no requirement for proper normalization, this problem is naturally circumvented. EBMs can be viewed as a form of non-probabilistic factor graphs, and they provide considerably more flexibility in th e design of architectures and training criteria than probabilistic approaches .

[1]  Yoshua Bengio,et al.  LeRec: A NN/HMM Hybrid for On-Line Handwriting Recognition , 1995, Neural Computation.

[2]  Thomas Hofmann,et al.  Investigating Loss Functions and Optimization Methods for Discriminative Learning of Label Sequences , 2003, EMNLP.

[3]  Yann LeCun,et al.  Synergistic Face Detection and Pose Estimation with Energy-Based Models , 2004, J. Mach. Learn. Res..

[4]  P. Gallinari,et al.  A speech recognizer optimally combining learning vector quantization, dynamic programming and multi-layer perceptron , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[6]  Mehryar Mohri,et al.  Finite-State Transducers in Language and Speech Processing , 1997, Comput. Linguistics.

[7]  Yoshua Bengio,et al.  Global optimization of a neural network-hidden Markov model hybrid , 1992, IEEE Trans. Neural Networks.

[8]  Brendan J. Frey,et al.  Factor graphs and the sum-product algorithm , 2001, IEEE Trans. Inf. Theory.

[9]  Erik McDermott,et al.  Discriminative Training for Speech Recognition , 1997 .

[10]  R. Kompe,et al.  Global optimization of a neural network-hidden Markov model hybrid , 1991, IJCNN-91-Seattle International Joint Conference on Neural Networks.

[11]  Yoshua Bengio,et al.  An Input Output HMM Architecture , 1994, NIPS.

[12]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[13]  Yee Whye Teh,et al.  Energy-Based Models for Sparse Overcomplete Representations , 2003, J. Mach. Learn. Res..

[14]  Yann LeCun,et al.  Toward automatic phenotyping of developing embryos from videos , 2005, IEEE Transactions on Image Processing.

[15]  Yoshua Bengio,et al.  Global training of document processing systems using graph transformer networks , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[16]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[17]  X. Jin Factor graphs and the Sum-Product Algorithm , 2002 .

[18]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[19]  Michael Collins,et al.  Discriminative Reranking for Natural Language Parsing , 2000, CL.

[20]  Renato De Mori,et al.  A hybrid coder for hidden Markov models using a recurrent neural networks , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[21]  Yann LeCun,et al.  Learning a similarity metric discriminatively, with application to face verification , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[22]  Yann LeCun,et al.  Improving the convergence of back-propagation learning with second-order methods , 1989 .

[23]  P. Haffner,et al.  Multi-State Time Delay Neural Networks for Continuous Speech Recognition , 1991 .

[24]  Biing-Hwang Juang,et al.  Minimum classification error rate methods for speech recognition , 1997, IEEE Trans. Speech Audio Process..

[25]  Alexander H. Waibel,et al.  Time-delay neural networks embedding time alignment: a performance analysis , 1991, EUROSPEECH.

[26]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[27]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[28]  A. Waibel,et al.  Connectionist Viterbi training: a new hybrid method for continuous speech recognition , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[29]  An introduction to the hybrid hmm/connectionist approach , 1995 .

[30]  Yves Normandin Maximum Mutual Information Estimation of Hidden Markov Models , 1996 .

[31]  Christopher J. C. Burges,et al.  Image Segmentation and Recognition , 1994 .

[32]  Ken-ichi Iso,et al.  Speaker-independent word recognition using dynamic programming neural networks , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[33]  Yann LeCun,et al.  Large-scale Learning with SVM and Convolutional for Generic Object Categorization , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[34]  Lalit R. Bahl,et al.  Maximum mutual information estimation of hidden Markov model parameters for speech recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[35]  Patrick Haffner,et al.  Connectionist speech recognition with a global MMI algorithm , 1993, EUROSPEECH.

[36]  Shigeru Katagiri,et al.  Prototype-based discriminative training for various speech units , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[37]  Patrick Gallinari,et al.  COMPARISON AND COOPERATION OF SEVERAL CLASSIFIERS , 1991 .

[38]  Ben Taskar,et al.  Max-Margin Markov Networks , 2003, NIPS.

[39]  Alexander H. Waibel,et al.  Multi-State Time Delay Networks for Continuous Speech Recognition , 1991, NIPS.

[40]  Thomas Hofmann,et al.  Large margin methods for label sequence learning , 2003, INTERSPEECH.

[41]  Jason Weston,et al.  Trading convexity for scalability , 2006, ICML.

[42]  Esther Levin,et al.  Accelerated Learning in Layered Neural Networks , 1988, Complex Syst..

[43]  Yann LeCun,et al.  Loss Functions for Discriminative Training of Energy-Based Models , 2005, AISTATS.

[44]  Yoshua Bengio,et al.  Word-level training of a handwritten word recognizer based on convolutional neural networks , 1994, Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol. 3 - Conference C: Signal Processing (Cat. No.94CH3440-5).

[45]  Léon Bottou,et al.  Stochastic Learning , 2003, Advanced Lectures on Machine Learning.

[46]  William T. Freeman,et al.  Constructing free-energy approximations and generalized belief propagation algorithms , 2005, IEEE Transactions on Information Theory.

[47]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[48]  Yann LeCun,et al.  Signature Verification Using A "Siamese" Time Delay Neural Network , 1993, Int. J. Pattern Recognit. Artif. Intell..

[49]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[50]  Yochai Konig,et al.  REMAP: Recursive Estimation and Maximization of A Posteriori Probabilities - Application to Transition-Based Connectionist Speech Recognition , 1995, NIPS.

[51]  Yann LeCun,et al.  Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[52]  Yoshua Bengio,et al.  Globally trained handwritten word recognizer using spatial representation, space displacement neural networks and hidden Markov models , 1993 .

[53]  Yoshua Bengio,et al.  Globally Trained Handwritten Word Recognizer Using Spatial Representation, Convolutional Neural Networks, and Hidden Markov Models , 1993, NIPS.

[54]  Yoshua Bengio,et al.  Neural networks for speech and sequence recognition , 1996 .

[55]  Alex Waibel,et al.  Integrating time alignment and neural networks for high performance continuous speech recognition , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[56]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[57]  P. Gallinari,et al.  Empirical risk optimisation: neural networks and dynamic programming , 1992, Neural Networks for Signal Processing II Proceedings of the 1992 IEEE Workshop.

[58]  Yariv Ephraim,et al.  Estimation of hidden Markov model parameters by minimizing empirical error rate , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[59]  Mark W. Schmidt,et al.  Accelerated training of conditional random fields with stochastic gradient methods , 2006, ICML.

[60]  Martial Hebert,et al.  Discriminative Fields for Modeling Spatial Dependencies in Natural Images , 2003, NIPS.

[61]  Hervé Bourlard,et al.  A Continuous Speech Recognition System Embedding MLP into HMM , 1989, NIPS.