论文信息 - A Theory of Feature Learning

A Theory of Feature Learning

Feature Learning aims to extract relevant information contained in data sets in an automated fashion. It is driving force behind the current deep learning trend, a set of methods that have had widespread empirical success. What is lacking is a theoretical understanding of different feature learning schemes. This work provides a theoretical framework for feature learning and then characterizes when features can be learnt in an unsupervised fashion. We also provide means to judge the quality of features via rate-distortion theory and its generalizations.

Robert C. Williamson | Brendan van Rooyen | R. C. Williamson

[1] Yoshua Bengio,et al. Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[2] Michael I. Jordan,et al. Unsupervised Kernel Dimension Reduction , 2010, NIPS.

[3] David J. Field,et al. Sparse coding with an overcomplete basis set: A strategy employed by V1? , 1997, Vision Research.

[4] Martin E. Hellman,et al. Probability of error, equivocation, and the Chernoff bound , 1970, IEEE Trans. Inf. Theory.

[5] Joseph T. Chang,et al. Conditioning as disintegration , 1997 .

[6] Pavel Berkhin,et al. A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[7] David M. Bradley,et al. Differentiable Sparse Coding , 2008, NIPS.

[8] Luc Devroye,et al. On the Performance of Clustering in Hilbert Spaces , 2008, IEEE Transactions on Information Theory.

[9] Thomas M. Cover,et al. Elements of Information Theory , 2005 .

[10] Massimiliano Pontil,et al. Generalization Bounds for K-Dimensional Coding Schemes in Hilbert Spaces , 2008, ALT.

[11] Naftali Tishby,et al. The information bottleneck method , 2000, ArXiv.

[12] Le Song,et al. Kernel Bayes' Rule , 2010, NIPS.

[13] A. Dawid,et al. Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory , 2004, math/0410076.

[14] Ralph Linsker,et al. An Application of the Principle of Maximum Information Preservation to Linear Systems , 1988, NIPS.

[15] Geoffrey E. Hinton,et al. Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[16] Mark D. Reid,et al. Information, Divergence and Risk for Binary Experiments , 2009, J. Mach. Learn. Res..

[17] Pascal Vincent,et al. Generalized Denoising Auto-Encoders as Generative Models , 2013, NIPS.

[18] L. L. Cam,et al. Sufficiency and Approximate Sufficiency , 1964 .

[19] Isabelle Guyon,et al. Clustering: Science or Art? , 2009, ICML Unsupervised and Transfer Learning.

[20] A. Dawid. The geometry of proper scoring rules , 2007 .

[21] S. Lauritzen,et al. Proper local scoring rules , 2011, 1101.5011.

[22] L. L. Cam,et al. Asymptotic Methods In Statistical Decision Theory , 1986 .

[23] J. Ziv,et al. A Generalization of the Rate-Distortion Theory and Applications , 1975 .

[24] Jean Ponce,et al. Task-Driven Dictionary Learning , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25] Inderjit S. Dhillon,et al. Clustering with Bregman Divergences , 2005, J. Mach. Learn. Res..

[26] David Simmons. Conditional measures and conditional expectation; Rohlin's Disintegration Theorem , 2012 .

[27] J. Eichel. Comparison Of Statistical Experiments , 2016 .