Score Function Features for Discriminative Learning: Matrix and Tensor Framework

Author(s): Janzamin, Majid; Sedghi, Hanie; Anandkumar, Anima | Abstract: Feature learning forms the cornerstone for tackling challenging learning problems in domains such as speech, computer vision and natural language processing. In this paper, we consider a novel class of matrix and tensor-valued features, which can be pre-trained using unlabeled samples. We present efficient algorithms for extracting discriminative information, given these pre-trained features and labeled samples for any related task. Our class of features are based on higher-order score functions, which capture local variations in the probability density function of the input. We establish a theoretical framework to characterize the nature of discriminative information that can be extracted from score-function features, when used in conjunction with labeled samples. We employ efficient spectral decomposition algorithms (on matrices and tensors) for extracting discriminative components. The advantage of employing tensor-valued features is that we can extract richer discriminative information in the form of an overcomplete representations. Thus, we present a novel framework for employing generative models of the input for discriminative learning.

[1]  Sanjeev Arora,et al.  New Algorithms for Learning Incoherent and Overcomplete Dictionaries , 2013, COLT.

[2]  Rajat Raina,et al.  Self-taught learning: transfer learning from unlabeled data , 2007, ICML '07.

[3]  Rajat Raina,et al.  Constructing informative priors using transfer learning , 2006, ICML.

[4]  Nando de Freitas,et al.  On Autoencoders and Score Matching for Energy Based Models , 2011, ICML.

[5]  Anima Anandkumar,et al.  Learning Overcomplete Latent Variable Models through Tensor Methods , 2014, COLT.

[6]  Anima Anandkumar,et al.  Provable Methods for Training Neural Networks with Sparse Connectivity , 2014, ICLR.

[7]  John Blitzer,et al.  Domain Adaptation with Structural Correspondence Learning , 2006, EMNLP.

[8]  C. Stein A bound for the error in the normal approximation to the distribution of a sum of dependent random variables , 1972 .

[9]  Prateek Jain,et al.  Learning Sparsely Used Overcomplete Dictionaries , 2014, COLT.

[10]  G. Reinert,et al.  Distributional Transformations, Orthogonal Polynomials, and Stein Characterizations , 2005, math/0510240.

[11]  Le Song,et al.  Nonparametric Estimation of Multi-View Latent Variable Models , 2013, ICML.

[12]  Christophe Ley,et al.  Parametric Stein operators and variance bounds , 2013, 1305.5067.

[13]  C. Stein Approximate computation of expectations , 1986 .

[14]  Quoc V. Le,et al.  ICA with Reconstruction Cost for Efficient Overcomplete Feature Learning , 2011, NIPS.

[15]  Koby Crammer,et al.  Analysis of Representations for Domain Adaptation , 2006, NIPS.

[16]  Yoshua Bengio,et al.  Deep Learning of Representations for Unsupervised and Transfer Learning , 2011, ICML Unsupervised and Transfer Learning.

[17]  Koby Crammer,et al.  A theory of learning from different domains , 2010, Machine Learning.

[18]  Trevor Darrell,et al.  Learning Visual Representations using Images with Captions , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Feiping Nie,et al.  Robust and Discriminative Self-Taught Learning , 2013, ICML.

[20]  Yoshua Bengio,et al.  How transferable are features in deep neural networks? , 2014, NIPS.

[21]  Andrew Y. Ng,et al.  Zero-Shot Learning Through Cross-Modal Transfer , 2013, NIPS.

[22]  P. Diaconis,et al.  Use of exchangeable pairs in the analysis of simulations , 2004 .

[23]  Anima Anandkumar,et al.  Tensor decompositions for learning latent variable models , 2012, J. Mach. Learn. Res..

[24]  Björn Holmquist,et al.  The d-variate vector Hermite polynomial of order k , 1996 .

[25]  Guillermo Sapiro,et al.  Supervised Dictionary Learning , 2008, NIPS.

[26]  Xiang Zhang,et al.  OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.

[27]  H. Grad Note on N‐dimensional hermite polynomials , 1949 .

[28]  Christopher Joseph Pal,et al.  Multi-Conditional Learning: Generative/Discriminative Training for Clustering and Classification , 2006, AAAI.

[29]  Aditya Bhaskara,et al.  Provable Bounds for Learning Some Deep Representations , 2013, ICML.

[30]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[31]  David Haussler,et al.  Exploiting Generative Models in Discriminative Classifiers , 1998, NIPS.

[32]  Paul Mineiro,et al.  Discriminative Features via Generalized Eigenvectors , 2013, ICML.

[33]  Yishay Mansour,et al.  Domain Adaptation with Multiple Sources , 2008, NIPS.

[34]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[35]  Tong Zhang,et al.  A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , 2005, J. Mach. Learn. Res..

[36]  Anima Anandkumar,et al.  Provable Learning of Overcomplete Latent Variable Models: Semi-supervised and Unsupervised Settings , 2014, ArXiv.

[37]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[39]  Aapo Hyvärinen,et al.  Clustering via Mode Seeking by Direct Estimation of the Gradient of a Log-Density , 2014, ECML/PKDD.

[40]  J. Kruskal Three-way arrays: rank and uniqueness of trilinear decompositions, with application to arithmetic complexity and statistics , 1977 .

[41]  Gene H. Golub,et al.  Rank-One Approximation to High Order Tensors , 2001, SIAM J. Matrix Anal. Appl..

[42]  Yiming Yang,et al.  Flexible latent variable models for multi-task learning , 2008, Machine Learning.

[43]  Anima Anandkumar,et al.  Provable Tensor Methods for Learning Mixtures of Classifiers , 2014, ArXiv.

[44]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Neural Networks , 2013 .

[45]  Siwei Lyu,et al.  Interpretation and Generalization of Score Matching , 2009, UAI.

[46]  Yoshua Bengio,et al.  What regularized auto-encoders learn from the data-generating distribution , 2012, J. Mach. Learn. Res..

[47]  Tommi S. Jaakkola,et al.  Partially labeled classification with Markov random walks , 2001, NIPS.

[48]  Rong Yan,et al.  Cross-domain video concept detection using adaptive svms , 2007, ACM Multimedia.

[49]  Laurens van der Maaten,et al.  Learning Discriminative Fisher Kernels , 2011, ICML.

[50]  Kiyoshi Asai,et al.  Marginalized kernels for biological sequences , 2002, ISMB.

[51]  Pascal Vincent,et al.  A Connection Between Score Matching and Denoising Autoencoders , 2011, Neural Computation.

[52]  Trevor Darrell,et al.  Efficient Learning of Domain-invariant Image Representations , 2013, ICLR.

[53]  Trevor Darrell,et al.  DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[54]  Honglak Lee,et al.  An Analysis of Single-Layer Networks in Unsupervised Feature Learning , 2011, AISTATS.

[55]  Anima Anandkumar,et al.  Guaranteed Non-Orthogonal Tensor Decomposition via Alternating Rank-1 Updates , 2014, ArXiv.

[56]  Kristen Grauman,et al.  Connecting the Dots with Landmarks: Discriminatively Learning Domain-Invariant Features for Unsupervised Domain Adaptation , 2013, ICML.

[57]  Aapo Hyvärinen,et al.  Estimation of Non-Normalized Statistical Models by Score Matching , 2005, J. Mach. Learn. Res..

[58]  Bernhard Schölkopf,et al.  Correcting Sample Selection Bias by Unlabeled Data , 2006, NIPS.