Learning Deeper Non-Monotonic Networks by Softly Transferring Solution Space

Different from popular neural networks using quasiconvex activations, non-monotonic networks activated by periodic nonlinearities have emerged as a more competitive paradigm, offering revolutionary benefits: 1) compactly characterizing highfrequency patterns; 2) precisely representing highorder derivatives. Nevertheless, they are also wellknown for being hard to train, due to easily overfitting dissonant noise and only allowing for tiny architectures (shallower than 5 layers). The fundamental bottleneck is that the periodicity leads to many poor and dense local minima in solution space. The direction and norm of gradient oscillate continually during error backpropagation. Thus non-monotonic networks are prematurely stuck in these local minima, and leave out effective error feedback. To alleviate the optimization dilemma, in this paper, we propose a non-trivial soft transfer approach. It smooths their solution space close to that of monotonic ones in the beginning, and then improve their representational properties by transferring the solutions from the neural space of monotonic neurons to the Fourier space of nonmonotonic neurons as the training continues. The soft transfer consists of two core components: 1) a rectified concrete gate is constructed to characterize the state of each neuron; 2) a variational Bayesian learning framework is proposed to dynamically balance the empirical risk and the intensity of transfer. We provide comprehensive empirical evidence showing that the soft transfer not only reduces the risk of non-monotonic networks on over-fitting noise, but also helps them scale to much deeper architectures (more than 100 layers) achieving the new state-of-the-art performance.

[1]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[2]  Tristan Bepler,et al.  Reconstructing continuous distributions of 3D protein structure from cryo-EM images , 2019, ICLR.

[3]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[4]  Jonathan T. Barron,et al.  Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains , 2020, NeurIPS.

[5]  Da Xu,et al.  Self-attention with Functional Time Representation Learning , 2019, NeurIPS.

[6]  Pratul P. Srinivasan,et al.  NeRF , 2020, ECCV.

[7]  Hui Xue,et al.  Deep Spectral Kernel Learning , 2019, IJCAI.

[8]  Sanjay Thakur,et al.  Time2Vec: Learning a Vector Representation of Time , 2019, ArXiv.

[9]  Quoc V. Le,et al.  Searching for Activation Functions , 2018, arXiv.

[10]  Gordon Wetzstein,et al.  Implicit Neural Representations with Periodic Activation Functions , 2020, NeurIPS.

[11]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[12]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[13]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[14]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[15]  Hui Xue,et al.  BaKer-Nets: Bayesian Random Kernel Mapping Networks , 2020, IJCAI.

[16]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[17]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[18]  Max Welling,et al.  Learning Sparse Neural Networks through L0 Regularization , 2017, ICLR.

[19]  Yee Whye Teh,et al.  The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables , 2016, ICLR.

[20]  Gordon Wetzstein,et al.  MetaSDF: Meta-learning Signed Distance Functions , 2020, NeurIPS.

[21]  Chris G. Willcocks,et al.  Gradient Origin Networks , 2020, ArXiv.