Modularizing Deep Learning via Pairwise Learning With Kernels

By redefining the conventional notions of layers, we present an alternative view on finitely wide, fully trainable deep neural networks as stacked linear models in feature spaces, leading to a kernel machine interpretation. Based on this construction, we then propose a provably optimal modular learning framework for classification that does not require between-module backpropagation. This modular approach brings new insights into the label requirement of deep learning (DL). It leverages only implicit pairwise labels (weak supervision) when learning the hidden modules. When training the output module, on the other hand, it requires full supervision but achieves high label efficiency, needing as few as ten randomly selected labeled examples (one from each class) to achieve 94.88% accuracy on CIFAR-10 using a ResNet-18 backbone. Moreover, modular training enables fully modularized DL workflows, which then simplify the design and implementation of pipelines and improve the maintainability and reusability of models. To showcase the advantages of such a modularized workflow, we describe a simple yet reliable method for estimating reusability of pretrained modules as well as task transferability in a transfer learning setting. At practically no computation overhead, it precisely described the task space structure of 15 binary classification tasks from CIFAR-10.

[1]  Johan A. K. Suykens,et al.  Training multilayer perceptron classifiers based on a modified support vector method , 1999, IEEE Trans. Neural Networks.

[2]  Ji Feng,et al.  Deep forest , 2017, IJCAI.

[3]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[4]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[6]  Koby Crammer,et al.  Analysis of Representations for Domain Adaptation , 2006, NIPS.

[7]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[8]  Miguel Á. Carreira-Perpiñán,et al.  Distributed optimization of deeply nested systems , 2012, AISTATS.

[9]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[10]  Cordelia Schmid,et al.  Convolutional Kernel Networks , 2014, NIPS.

[11]  Ruosong Wang,et al.  Enhanced Convolutional Neural Tangent Kernels , 2019, ArXiv.

[12]  Yoshua Bengio,et al.  Difference Target Propagation , 2014, ECML/PKDD.

[13]  Edward T. Bullmore,et al.  Modular and Hierarchically Modular Organization of Brain Networks , 2010, Front. Neurosci..

[14]  Jonathan Ragan-Kelley,et al.  Neural Kernels Without Tangents , 2020, ICML.

[15]  Sindy Löwe,et al.  Putting An End to End-to-End: Gradient-Isolated Learning of Representations , 2019, NeurIPS.

[16]  Weifeng Liu,et al.  Correntropy: Properties and Applications in Non-Gaussian Signal Processing , 2007, IEEE Transactions on Signal Processing.

[17]  Kunio Kashino,et al.  Modular representation of layered neural networks , 2017, Neural Networks.

[18]  Tomas Hrycej Modular learning in neural networks - a modularized approach to neural network classification , 1992, Sixth-Generation computer technology series.

[19]  Ethem Alpaydin,et al.  Multiple Kernel Learning Algorithms , 2011, J. Mach. Learn. Res..

[20]  Tal Hassner,et al.  Transferability and Hardness of Supervised Classification Tasks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[21]  Jose C. Principe,et al.  Measuring the Discrepancy between Conditional Distributions: Methods, Properties and Applications , 2020, IJCAI.

[22]  Robert A. Jacobs,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1993, Neural Computation.

[23]  Tal Hassner,et al.  LEEP: A New Measure to Evaluate Transferability of Learned Representations , 2020, ICML.

[24]  Lawrence K. Saul,et al.  Kernel Methods for Deep Learning , 2009, NIPS.

[25]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[26]  Christian Lebiere,et al.  The Cascade-Correlation Learning Architecture , 1989, NIPS.

[27]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[28]  Jaehoon Lee,et al.  Deep Neural Networks as Gaussian Processes , 2017, ICLR.

[29]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[30]  Subhransu Maji,et al.  Task2Vec: Task Embedding for Meta-Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[31]  Ruosong Wang,et al.  On Exact Computation with an Infinitely Wide Neural Net , 2019, NeurIPS.

[32]  Joachim M. Buhmann,et al.  Kickback Cuts Backprop's Red-Tape: Biologically Plausible Credit Assignment in Neural Networks , 2014, AAAI.

[33]  François Laviolette,et al.  Domain-Adversarial Training of Neural Networks , 2015, J. Mach. Learn. Res..

[34]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .

[35]  Alex Graves,et al.  Decoupled Neural Interfaces using Synthetic Gradients , 2016, ICML.

[36]  Yunmei Chen,et al.  On Kernel Method–Based Connectionist Models and Supervised Deep Learning Without Backpropagation , 2020, Neural Computation.

[37]  Marco Cuturi,et al.  Wasserstein regularization for sparse multi-task regression , 2018, AISTATS.

[38]  M. Alexander,et al.  Principles of Neural Science , 1981 .

[39]  Yoshua Bengio,et al.  How Auto-Encoders Could Provide Credit Assignment in Deep Networks via Target Propagation , 2014, ArXiv.

[40]  Hod Lipson,et al.  The evolutionary origins of modularity , 2012, Proceedings of the Royal Society B: Biological Sciences.

[41]  Shai Ben-David,et al.  Exploiting Task Relatedness for Mulitple Task Learning , 2003, COLT.

[42]  Xuanjing Huang,et al.  Adversarial Multi-task Learning for Text Classification , 2017, ACL.

[43]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[44]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[45]  David Berthelot,et al.  FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence , 2020, NeurIPS.

[46]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[47]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[48]  Pierre-Olivier Amblard,et al.  A Primer on Reproducing Kernel Hilbert Spaces , 2014, Found. Trends Signal Process..

[49]  Inderjit S. Dhillon,et al.  Low-Rank Kernel Learning with Bregman Matrix Divergences , 2009, J. Mach. Learn. Res..

[50]  Leonidas J. Guibas,et al.  Taskonomy: Disentangling Task Transfer Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[51]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[52]  Mikhail Khodak,et al.  A Theoretical Analysis of Contrastive Unsupervised Representation Learning , 2019, ICML.

[53]  Geoffrey E. Hinton,et al.  Adaptive Mixtures of Local Experts , 1991, Neural Computation.

[54]  N. Cristianini,et al.  On Kernel-Target Alignment , 2001, NIPS.

[55]  Ruosong Wang,et al.  Harnessing the Power of Infinitely Wide Deep Nets on Small-data Tasks , 2019, ICLR.

[56]  Leonidas J. Guibas,et al.  An Information-Theoretic Approach to Transferability in Task Transfer Learning , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[57]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[58]  David E. Carlson,et al.  Extracting Relationships by Multi-Domain Matching , 2018, NeurIPS.