TRANSFERRING OPTIMALITY ACROSS DATA DISTRI-

Homotopy methods, also known as continuation methods, are a powerful mathematical tool to efficiently solve various problems in numerical analysis. In this work, we propose a novel homotopy-based numerical method that can be used to gradually transfer optimized parameters of a neural network across different data distributions. This method generalizes the widely-used heuristic of pre-training parameters on one dataset and then fine-tuning them on another dataset of interest. We conduct a theoretical analysis showing that, under some assumptions, the homotopy method combined with Stochastic Gradient Descent (SGD) is guaranteed to converge in expectation to an rθ-optimal solution for a target task when started from an expected rθ-optimal solution on a source task. Empirical evaluations on a toy regression dataset and for transferring optimized parameters from MNIST to Fashion-MNIST and CIFAR-10 show substantial improvement of the numerical performance over random initialization and pre-training.

[1]  E. Allgower,et al.  Numerical Continuation Methods , 1990 .

[2]  Eugene L. Allgower,et al.  Numerical continuation methods - an introduction , 1990, Springer series in computational mathematics.

[3]  Sepp Hochreiter,et al.  Learning to Learn Using Gradient Descent , 2001, ICANN.

[4]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[5]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[6]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[7]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[8]  Dinh Quoc Tran,et al.  Adjoint-Based Predictor-Corrector Sequential Convex Programming for Parametric Nonlinear Optimization , 2012, SIAM J. Optim..

[9]  Bernt Schiele,et al.  Transfer Learning in a Transductive Setting , 2013, NIPS.

[10]  Yoshua Bengio,et al.  How transferable are features in deep neural networks? , 2014, NIPS.

[11]  Jeff G. Schneider,et al.  Flexible Transfer Learning under Support and Model Shift , 2014, NIPS.

[12]  Mark W. Schmidt Convergence rate of stochastic gradient with constant step size , 2014 .

[13]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[14]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[15]  Juan C. Caicedo,et al.  Fine-tuning Deep Convolutional Networks for Plant Recognition , 2015, CLEF.

[16]  Hossein Mobahi,et al.  Training Recurrent Neural Networks by Diffusion , 2016, ArXiv.

[17]  Joachim Denzler,et al.  Fine-Tuning Deep Neural Networks in Continuous Learning Scenarios , 2016, ACCV Workshops.

[18]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[19]  Misha Denil,et al.  Noisy Activation Functions , 2016, ICML.

[20]  Dim P. Papadopoulos,et al.  How Hard Can It Be? Estimating the Difficulty of Visual Search in an Image , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Yoram Singer,et al.  Toward Deeper Understanding of Neural Networks: The Power of Initialization and a Dual View on Expressivity , 2016, NIPS.

[22]  David Balduzzi,et al.  Neural Taylor Approximations: Convergence and Exploration in Rectifier Networks , 2016, ICML.

[23]  Alex Graves,et al.  Automated Curriculum Learning for Neural Networks , 2017, ICML.

[24]  Sepp Hochreiter,et al.  Self-Normalizing Neural Networks , 2017, NIPS.

[25]  Sanmit Narvekar,et al.  Curriculum Learning in Reinforcement Learning , 2017, IJCAI.

[26]  Yoshua Bengio,et al.  Mollifying Networks , 2016, ICLR.

[27]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[28]  Oliver Zendel,et al.  Analyzing Computer Vision Data — The Good, the Bad and the Ugly , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Sashank J. Reddi,et al.  On the Convergence of Adam and Beyond , 2018, ICLR.

[30]  David Tse,et al.  Porcupine Neural Networks: Approximating Neural Network Landscapes , 2018, NeurIPS.

[31]  Hao Li,et al.  Visualizing the Loss Landscape of Neural Nets , 2017, NeurIPS.

[32]  Katja Hofmann,et al.  CAML: Fast Context Adaptation via Meta-Learning , 2018, ArXiv.

[33]  David Rolnick,et al.  How to Start Training: The Effect of Initialization and Architecture , 2018, NeurIPS.

[34]  D. Weinshall,et al.  Curriculum Learning by Transfer Learning: Theory and Experiments with Deep Networks , 2018, ICML.

[35]  Zhiqiang Shen,et al.  Transfer Learning for Sequences via Learning to Collocate , 2019, ICLR.

[36]  Daniel Kunin,et al.  Loss Landscapes of Regularized Linear Autoencoders , 2019, ICML.

[37]  Daphna Weinshall,et al.  On The Power of Curriculum Learning in Training Deep Networks , 2019, ICML.

[38]  Moritz Diehl,et al.  Contraction Estimates for Abstract Real-Time Algorithms for NMPC , 2019, 2019 IEEE 58th Conference on Decision and Control (CDC).

[39]  Suryabhan Arjun Sangale Lecture notes on Topology , 2020 .