Networks with trainable amplitude of activation functions

Network training algorithms have heavily concentrated on the learning of connection weights. Little effort has been made to learn the amplitude of activation functions, which defines the range of values that the function can take. This paper introduces novel algorithms to learn the amplitudes of nonlinear activations in layered networks, without any assumption on their analytical form. Three instances of the algorithms are developed: (i) a common amplitude is shared among all nonlinear units; (ii) each layer has its own amplitude; and (iii) neuron-specific amplitudes are allowed. The algorithms can also be seen as a particular double-step gradient-descent procedure, as gradient-driven adaptive learning rate schemes, or as weight-grouping techniques that are consistent with known scaling laws for regularization with weight decay. As a side effect, a self-pruning mechanism of redundant neurons may emerge. Experimental results on function approximation, classification, and regression tasks, with synthetic and real-world data, validate the approach and show that the algorithms speed up convergence and modify the search path in the weight space, possibly reaching deeper minima that may also improve generalization.

[1]  Keinosuke Fukunaga,et al.  Statistical Pattern Recognition , 1993, Handbook of Pattern Recognition and Computer Vision.

[2]  Xuedong Huang Speaker normalization for speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  J. S. Hunter,et al.  Statistics for Experimenters: An Introduction to Design, Data Analysis, and Model Building. , 1979 .

[4]  Geoffrey E. Hinton,et al.  Experiments on Learning by Back Propagation. , 1986 .

[5]  Isabelle Guyon Applications of Neural Networks to Character Recognition , 1991, Int. J. Pattern Recognit. Artif. Intell..

[6]  Diego Giuliani,et al.  Speaker normalization with a mixture of recurrent networks , 1997, ESANN.

[7]  Diego Giuliani,et al.  A Mixture of Recurrent Neural Networks for Speaker Normalisation , 2001, Neural Computing & Applications.

[8]  N.V. Bhat,et al.  Modeling chemical process systems via neural computation , 1990, IEEE Control Systems Magazine.

[9]  Yann LeCun,et al.  Learning processes in an asymmetric threshold network , 1986 .

[10]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[11]  E. Deprettere SVD and signal processing: algorithms, applications and architectures , 1989 .

[12]  Cesare Furlanello,et al.  Speaker Normalization and Model Selection of Combined Neural Networks , 1997, Connect. Sci..

[13]  E. Polak,et al.  Note sur la convergence de méthodes de directions conjuguées , 1969 .

[14]  Françoise Fogelman-Soulié,et al.  Disordered Systems and Biological Organization , 1986, NATO ASI Series.

[15]  Yann LeCun,et al.  Generalization and network design strategies , 1989 .

[16]  Maurizio Omologo,et al.  Speaker independent continuous speech recognition using an acoustic-phonetic Italian corpus , 1994, ICSLP.

[17]  Geoffrey E. Hinton,et al.  Adaptive Mixtures of Local Experts , 1991, Neural Computation.

[18]  Richard J. Mammone,et al.  Speaker recognition using neural networks and conventional classifiers , 1994, IEEE Trans. Speech Audio Process..

[19]  Tomaso A. Poggio,et al.  Regularization Theory and Neural Networks Architectures , 1995, Neural Computation.

[20]  J. E. Porter,et al.  Normalizations and selection of speech segments for speaker recognition scoring , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[21]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[22]  Sadaoki Furui,et al.  Advances in Speech Signal Processing , 1991 .

[23]  R. Palmer,et al.  Introduction to the theory of neural computation , 1994, The advanced book program.

[24]  John Moody,et al.  Fast Learning in Networks of Locally-Tuned Processing Units , 1989, Neural Computation.

[25]  Elie Bienenstock,et al.  Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[26]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[27]  P. Werbos,et al.  Beyond Regression : "New Tools for Prediction and Analysis in the Behavioral Sciences , 1974 .

[28]  David J. C. MacKay,et al.  Bayesian Interpolation , 1992, Neural Computation.

[29]  Keinosuke Fukunaga,et al.  Introduction to statistical pattern recognition (2nd ed.) , 1990 .

[30]  David J. C. MacKay,et al.  A Practical Bayesian Framework for Backpropagation Networks , 1992, Neural Computation.

[31]  David J. C. MacKay,et al.  BAYESIAN NON-LINEAR MODELING FOR THE PREDICTION COMPETITION , 1996 .

[32]  F. Girosi,et al.  Networks for approximation and learning , 1990, Proc. IEEE.

[33]  Lawrence D. Jackel,et al.  Handwritten Digit Recognition with a Back-Propagation Network , 1989, NIPS.

[34]  Michael C. Mozer,et al.  Skeletonization: A Technique for Trimming the Fat from a Network via Relevance Assessment , 1988, NIPS.

[35]  S. Furui,et al.  Cepstral analysis technique for automatic speaker verification , 1981 .

[36]  Peter E. Hart,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[37]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[38]  Daniele Falavigna Comparison of different HMM based methods for speaker verification , 1995, EUROSPEECH.

[39]  Wei-Der Chang,et al.  A feedforward neural network with function shape autotuning , 1996, Neural Networks.

[40]  Geoffrey E. Hinton,et al.  Simplifying Neural Networks by Soft Weight-Sharing , 1992, Neural Computation.

[41]  Yoshua Bengio,et al.  Neural networks for speech and sequence recognition , 1996 .

[42]  Robert A. Jacobs,et al.  Increased rates of convergence through learning rate adaptation , 1987, Neural Networks.

[43]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[44]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[45]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[46]  Takayuki Yamada,et al.  Neural network controller using autotuning method for nonlinear functions , 1992, IEEE Trans. Neural Networks.