Dataset Meta-Learning from Kernel Ridge-Regression

One of the most fundamental aspects of any machine learning algorithm is the training data used by the algorithm. We introduce the novel concept of $\epsilon$-approximation of datasets, obtaining datasets which are much smaller than or are significant corruptions of the original training data while maintaining similar model performance. We introduce a meta-learning algorithm called Kernel Inducing Points (KIP) for obtaining such remarkable datasets, inspired by the recent developments in the correspondence between infinitely-wide neural networks and kernel ridge-regression (KRR). For KRR tasks, we demonstrate that KIP can compress datasets by one or two orders of magnitude, significantly improving previous dataset distillation and subset selection methods while obtaining state of the art results for MNIST and CIFAR-10 classification. Furthermore, our KIP-learned datasets are transferable to the training of finite-width neural networks even beyond the lazy-training regime, which leads to state of the art results for neural network dataset distillation with potential applications to privacy-preservation.

[1]  Jaehoon Lee,et al.  Bayesian Deep Convolutional Networks with Many Channels are Gaussian Processes , 2018, ICLR.

[2]  Hasan Şakir Bilge,et al.  Deep Metric Learning: A Survey , 2019, Symmetry.

[3]  Ruosong Wang,et al.  Harnessing the Power of Infinitely Wide Deep Nets on Small-data Tasks , 2019, ICLR.

[4]  Jaehoon Lee,et al.  Finite Versus Infinite Neural Networks: an Empirical Study , 2020, NeurIPS.

[5]  Sam Shleifer,et al.  Using Small Proxy Datasets to Accelerate Hyperparameter Search , 2019, ArXiv.

[6]  Tosio Kato Perturbation theory for linear operators , 1966 .

[7]  Hakan Bilen,et al.  Dataset Condensation with Gradient Matching , 2020, ICLR.

[8]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[9]  Richard S. Zemel,et al.  Prototypical Networks for Few-shot Learning , 2017, NIPS.

[10]  Laurence Aitchison,et al.  Deep Convolutional Networks as shallow Gaussian Processes , 2018, ICLR.

[11]  Ryan P. Adams,et al.  Gradient-based Hyperparameter Optimization through Reversible Learning , 2015, ICML.

[12]  Petros Drineas,et al.  On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning , 2005, J. Mach. Learn. Res..

[13]  Ingo Steinwart,et al.  Sparseness of Support Vector Machines , 2003, J. Mach. Learn. Res..

[14]  Kai Li,et al.  InstaHide: Instance-hiding Schemes for Private Distributed Learning , 2020, ICML.

[15]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[16]  Jaehoon Lee,et al.  Deep Neural Networks as Gaussian Processes , 2017, ICLR.

[17]  David Duvenaud,et al.  Optimizing Millions of Hyperparameters by Implicit Differentiation , 2019, AISTATS.

[18]  Andreas Krause,et al.  Coresets via Bilevel Optimization for Continual Learning and Streaming , 2020, NeurIPS.

[19]  Ruosong Wang,et al.  On Exact Computation with an Infinitely Wide Neural Net , 2019, NeurIPS.

[20]  Oriol Vinyals,et al.  Matching Networks for One Shot Learning , 2016, NIPS.

[21]  Cordelia Schmid,et al.  Convolutional Kernel Networks , 2014, NIPS.

[22]  Luca Bertinetto,et al.  Meta-learning with differentiable closed-form solvers , 2018, ICLR.

[23]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[24]  Jaehoon Lee,et al.  On the infinite width limit of neural networks with a standard parameterization , 2020, ArXiv.

[25]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[26]  Zoubin Ghahramani,et al.  Sparse Gaussian Processes using Pseudo-inputs , 2005, NIPS.

[27]  Jonathan Ragan-Kelley,et al.  Neural Kernels Without Tangents , 2020, ICML.

[28]  Yongxin Yang,et al.  Flexible Dataset Distillation: Learn Labels Instead of Images , 2020, ArXiv.

[29]  Subhransu Maji,et al.  Meta-Learning With Differentiable Convex Optimization , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Jaehoon Lee,et al.  Neural Tangents: Fast and Easy Infinite Neural Networks in Python , 2019, ICLR.

[31]  Jason Weston,et al.  Fast Kernel Classifiers with Online and Active Learning , 2005, J. Mach. Learn. Res..

[32]  Alexei A. Efros,et al.  Dataset Distillation , 2018, ArXiv.

[33]  Jaehoon Lee,et al.  Wide neural networks of any depth evolve as linear models under gradient descent , 2019, NeurIPS.

[34]  Richard E. Turner,et al.  Gaussian Process Behaviour in Wide Deep Neural Networks , 2018, ICLR.

[35]  Michalis K. Titsias,et al.  Variational Learning of Inducing Variables in Sparse Gaussian Processes , 2009, AISTATS.

[36]  Percy Liang,et al.  Understanding Black-box Predictions via Influence Functions , 2017, ICML.

[37]  Matthias Schonlau,et al.  Soft-Label Dataset Distillation and Text Dataset Distillation , 2021, 2021 International Joint Conference on Neural Networks (IJCNN).

[38]  Ian Goodfellow,et al.  Deep Learning with Differential Privacy , 2016, CCS.