A Theory of Feature Learning

Feature Learning aims to extract relevant information contained in data sets in an automated fashion. It is driving force behind the current deep learning trend, a set of methods that have had widespread empirical success. What is lacking is a theoretical understanding of different feature learning schemes. This work provides a theoretical framework for feature learning and then characterizes when features can be learnt in an unsupervised fashion. We also provide means to judge the quality of features via rate-distortion theory and its generalizations.

[1]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[2]  Michael I. Jordan,et al.  Unsupervised Kernel Dimension Reduction , 2010, NIPS.

[3]  David J. Field,et al.  Sparse coding with an overcomplete basis set: A strategy employed by V1? , 1997, Vision Research.

[4]  Martin E. Hellman,et al.  Probability of error, equivocation, and the Chernoff bound , 1970, IEEE Trans. Inf. Theory.

[5]  Joseph T. Chang,et al.  Conditioning as disintegration , 1997 .

[6]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[7]  David M. Bradley,et al.  Differentiable Sparse Coding , 2008, NIPS.

[8]  Luc Devroye,et al.  On the Performance of Clustering in Hilbert Spaces , 2008, IEEE Transactions on Information Theory.

[9]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[10]  Massimiliano Pontil,et al.  Generalization Bounds for K-Dimensional Coding Schemes in Hilbert Spaces , 2008, ALT.

[11]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[12]  Le Song,et al.  Kernel Bayes' Rule , 2010, NIPS.

[13]  A. Dawid,et al.  Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory , 2004, math/0410076.

[14]  Ralph Linsker,et al.  An Application of the Principle of Maximum Information Preservation to Linear Systems , 1988, NIPS.

[15]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[16]  Mark D. Reid,et al.  Information, Divergence and Risk for Binary Experiments , 2009, J. Mach. Learn. Res..

[17]  Pascal Vincent,et al.  Generalized Denoising Auto-Encoders as Generative Models , 2013, NIPS.

[18]  L. L. Cam,et al.  Sufficiency and Approximate Sufficiency , 1964 .

[19]  Isabelle Guyon,et al.  Clustering: Science or Art? , 2009, ICML Unsupervised and Transfer Learning.

[20]  A. Dawid The geometry of proper scoring rules , 2007 .

[21]  S. Lauritzen,et al.  Proper local scoring rules , 2011, 1101.5011.

[22]  L. L. Cam,et al.  Asymptotic Methods In Statistical Decision Theory , 1986 .

[23]  J. Ziv,et al.  A Generalization of the Rate-Distortion Theory and Applications , 1975 .

[24]  Jean Ponce,et al.  Task-Driven Dictionary Learning , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Inderjit S. Dhillon,et al.  Clustering with Bregman Divergences , 2005, J. Mach. Learn. Res..

[26]  David Simmons Conditional measures and conditional expectation; Rohlin's Disintegration Theorem , 2012 .

[27]  J. Eichel Comparison Of Statistical Experiments , 2016 .