Learning to Disentangle Factors of Variation with Manifold Interaction

Many latent factors of variation interact to generate sensory data; for example, pose, morphology and expression in face images. In this work, we propose to learn manifold coordinates for the relevant factors of variation and to model their joint interaction. Many existing feature learning algorithms focus on a single task and extract features that are sensitive to the task-relevant factors and invariant to all others. However, models that just extract a single set of invariant features do not exploit the relationships among the latent factors. To address this, we propose a higher-order Boltzmann machine that incorporates multiplicative interactions among groups of hidden units that each learn to encode a distinct factor of variation. Furthermore, we propose correspondence-based training strategies that allow effective disentangling. Our model achieves state-of-the-art emotion recognition and face verification performance on the Toronto Face Database. We also demonstrate disentangled features learned on the CMU Multi-PIE dataset.

[1]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[2]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[3]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[4]  Joshua B. Tenenbaum,et al.  Separating Style and Content with Bilinear Models , 2000, Neural Computation.

[5]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[6]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[7]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[8]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[9]  Yann LeCun,et al.  Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[10]  Massimiliano Pontil,et al.  Multi-Task Feature Learning , 2006, NIPS.

[11]  Honglak Lee,et al.  Sparse deep belief net model for visual area V2 , 2007, NIPS.

[12]  Marc'Aurelio Ranzato,et al.  Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[14]  Tijmen Tieleman,et al.  Training restricted Boltzmann machines using approximations to the likelihood gradient , 2008, ICML '08.

[15]  Yoshua Bengio,et al.  Classification using discriminative restricted Boltzmann machines , 2008, ICML '08.

[16]  Geoffrey E. Hinton Reducing the Dimensionality of Data with Neural , 2008 .

[17]  Takeo Kanade,et al.  Multi-PIE , 2008, 2008 8th IEEE International Conference on Automatic Face & Gesture Recognition.

[18]  R. Fergus,et al.  Learning invariant features through topographic filter maps , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Geoffrey E. Hinton,et al.  Deep Boltzmann Machines , 2009, AISTATS.

[20]  Geoffrey E. Hinton,et al.  Learning to Represent Spatial Transformations with Factored Higher-Order Boltzmann Machines , 2010, Neural Computation.

[21]  Veselin Stoyanov,et al.  Empirical Risk Minimization of Graphical Model Parameters Given Approximate Inference, Decoding, and Model Structure , 2011, AISTATS.

[22]  Pascal Vincent,et al.  Contractive Auto-Encoders: Explicit Invariance During Feature Extraction , 2011, ICML.

[23]  Quoc V. Le,et al.  Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis , 2011, CVPR 2011.

[24]  Geoffrey E. Hinton,et al.  Modeling the joint density of two images under a variety of transformations , 2011, CVPR 2011.

[25]  Andrew Y. Ng,et al.  The Importance of Encoding Versus Training with Sparse Coding and Vector Quantization , 2011, ICML.

[26]  Yoshua Bengio,et al.  A Spike and Slab Restricted Boltzmann Machine , 2011, AISTATS.

[27]  Honglak Lee,et al.  Unsupervised learning of hierarchical representations with convolutional deep belief networks , 2011, Commun. ACM.

[28]  Geoffrey E. Hinton,et al.  On deep generative models with applications to recognition , 2011, CVPR 2011.

[29]  Honglak Lee,et al.  Learning to Align from Scratch , 2012, NIPS.

[30]  Yoshua Bengio,et al.  Disentangling Factors of Variation via Generative Entangling , 2012, ArXiv.

[31]  Honglak Lee,et al.  Learning Invariant Representations with Local Transformations , 2012, ICML.

[32]  Honglak Lee,et al.  Learning hierarchical representations for face verification with convolutional deep belief networks , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Geoffrey E. Hinton A Practical Guide to Training Restricted Boltzmann Machines , 2012, Neural Networks: Tricks of the Trade.

[34]  Pascal Vincent,et al.  Disentangling Factors of Variation for Facial Expression Recognition , 2012, ECCV.

[35]  Geoffrey E. Hinton,et al.  Tensor Analyzers , 2013, ICML.

[36]  Yoshua Bengio,et al.  Multi-Prediction Deep Boltzmann Machines , 2013, NIPS.

[37]  Honglak Lee,et al.  Learning and Selecting Features Jointly with Point-wise Gated Boltzmann Machines , 2013, ICML.

[38]  Yoshua Bengio,et al.  Better Mixing via Deep Representations , 2012, ICML.