A majorization-minimization algorithm for (multiple) hyperparameter learning

We present a general Bayesian framework for hyperparameter tuning in L2-regularized supervised learning models. Paradoxically, our algorithm works by first analytically integrating out the hyperparameters from the model. We find a local optimum of the resulting non-convex optimization problem efficiently using a majorization-minimization (MM) algorithm, in which the non-convex problem is reduced to a series of convex L2-regularized parameter estimation tasks. The principal appeal of our method is its simplicity: the updates for choosing the L2-regularized subproblems in each step are trivial to implement (or even perform by hand), and each subproblem can be efficiently solved by adapting existing solvers. Empirical results on a variety of supervised learning models show that our algorithm is competitive with both grid-search and gradient-based algorithms, but is more efficient and far easier to implement.

[1]  Serafim Batzoglou,et al.  CONTRAfold: RNA secondary structure prediction without physics-based models , 2006, ISMB.

[2]  Gavin C. Cawley,et al.  Gene Selection in Cancer Classification using Sparse Logistic Regression with Bayesian Regularisation , 2006 .

[3]  Christian Igel,et al.  Gradient-Based Adaptation of General Gaussian Kernels , 2005, Neural Computation.

[4]  Mário A. T. Figueiredo Adaptive Sparseness for Supervised Learning , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  J. Larsen,et al.  Design and regularization of neural networks: the optimal use of a validation set , 1996, Neural Networks for Signal Processing VI. Proceedings of the 1996 IEEE Signal Processing Society Workshop.

[6]  Sean R. Eddy,et al.  Rfam: an RNA family database , 2003, Nucleic Acids Res..

[7]  Alan L. Yuille,et al.  The Concave-Convex Procedure (CCCP) , 2001, NIPS.

[8]  L.K. Hansen,et al.  Adaptive regularization of neural classifiers , 1997, Neural Networks for Signal Processing VII. Proceedings of the 1997 IEEE Signal Processing Society Workshop.

[9]  Alan L. Yuille,et al.  The Concave-Convex Procedure , 2003, Neural Computation.

[10]  S. Sathiya Keerthi,et al.  An Efficient Method for Gradient-Based Adaptation of Hyperparameters in SVM Models , 2006, NIPS.

[11]  D. Hunter,et al.  Optimization Transfer Using Surrogate Objective Functions , 2000 .

[12]  Chuan-Sheng Foo,et al.  Efficient multiple hyperparameter learning for log-linear models , 2007, NIPS.

[13]  David J. C. MacKay,et al.  Bayesian Interpolation , 1992, Neural Computation.

[14]  Michael E. Tipping,et al.  Fast Marginal Likelihood Maximisation for Sparse Bayesian Models , 2003 .

[15]  B. Schölkopf,et al.  Sparse Multinomial Logistic Regression via Bayesian L1 Regularisation , 2007 .

[16]  Sayan Mukherjee,et al.  Choosing Multiple Parameters for Support Vector Machines , 2002, Machine Learning.

[17]  Yoram Bresler,et al.  Globally convergent edge-preserving regularized reconstruction: an application to limited-angle tomography , 1998, IEEE Trans. Image Process..

[18]  Wray L. Buntine,et al.  Bayesian Back-Propagation , 1991, Complex Syst..

[19]  Peter M. Williams,et al.  Bayesian Regularization and Pruning Using a Laplace Prior , 1995, Neural Computation.

[20]  George Eastman House,et al.  Sparse Bayesian Learning and the Relevan e Ve tor Ma hine , 2001 .

[21]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[22]  Jan Larsen,et al.  Adaptive regularization of neural networks using conjugate gradient , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[23]  Lars Kai Hansen,et al.  Adaptive Regularization in Neural Network Modeling , 1996, Neural Networks: Tricks of the Trade.

[24]  Stephen P. Boyd,et al.  Log-det heuristic for matrix rank minimization with applications to Hankel and Euclidean distance matrices , 2003, Proceedings of the 2003 American Control Conference, 2003..