Combination of supervised and unsupervised learning for training the activation functions of neural networks

Standard feedforward neural networks benefit from the nice theoretical properties of mixtures of sigmoid activation functions, but they may fail in several practical learning tasks. These tasks would be better faced by relying on a more appropriate, problem-specific basis of activation functions. The paper presents a connectionist model which exploits adaptive activation functions. Each hidden unit in the network is associated with a specific pair (f(.),p(.)), where f(.) is the activation function and p(.) is the likelihood of the unit being relevant to the computation of the network output over the current input. The function f(.) is optimized in a supervised manner, while p(.) is realized via a statistical parametric model learned through unsupervised (or, partially supervised) estimation. Since f(.) and p(.) influence each other's learning process, the overall machine is implicitly a co-trained coupled model and, in turn, a flexible, non-standard neural architecture. Feasibility of the approach is corroborated by empirical evidence yielded by computer simulations involving regression and classification tasks.

[1]  R. Palmer,et al.  Introduction to the theory of neural computation , 1994, The advanced book program.

[2]  David G. Stork,et al.  Pattern Classification , 1973 .

[3]  Wei-Der Chang,et al.  A feedforward neural network with function shape autotuning , 1996, Neural Networks.

[4]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[5]  Francesco Piazza,et al.  Learning and Approximation Capabilities of Adaptive Spline Activation Function Neural Networks , 1998, Neural Networks.

[6]  Jaime S. Cardoso,et al.  Diagnostic of Pathology on the Vertebral Column with Embedded Reject Option , 2011, IbPRIA.

[7]  Hong Chen,et al.  Universal approximation to nonlinear operators by neural networks with arbitrary activation functions and its application to dynamical systems , 1995, IEEE Trans. Neural Networks.

[8]  N. Hjort,et al.  Comprar Model Selection and Model Averaging | Gerda Claeskens | 9780521852258 | Cambridge University Press , 2008 .

[9]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[10]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1992, Math. Control. Signals Syst..

[11]  Takayuki Yamada,et al.  Neural network controller using autotuning method for nonlinear functions , 1992, IEEE Trans. Neural Networks.

[12]  H. White,et al.  Universal approximation using feedforward networks with non-sigmoid hidden layer activation functions , 1989, International 1989 Joint Conference on Neural Networks.

[13]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[14]  Edmondo Trentin,et al.  Networks with trainable amplitude of activation functions , 2001, Neural Networks.

[15]  Edmondo Trentin,et al.  Semi-unsupervised Weighted Maximum-Likelihood Estimation of Joint Densities for the Co-training of Adaptive Activation Functions , 2011, PSL.

[16]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[17]  Edmondo Trentin,et al.  Supervised and Unsupervised Co-training of Adaptive Activation Functions in Neural Nets , 2011, PSL.

[18]  Nils Lid Hjort,et al.  Model Selection and Model Averaging , 2001 .

[19]  Marco Gori,et al.  Inversion-based nonlinear adaptation of noisy acoustic parameters for a neural/HMM speech recognizer , 2006, Neurocomputing.

[20]  Piotr A. Kowalski,et al.  Complete Gradient Clustering Algorithm for Features Analysis of X-Ray Images , 2010 .

[21]  A. Linden,et al.  Inversion of multilayer nets , 1989, International 1989 Joint Conference on Neural Networks.