Variational Optimization

We discuss a general technique that can be used to form a differentiable bound on the optima of non-differentiable or discrete objective functions. We form a unified description of these methods and consider under which circumstances the bound is concave. In particular we consider two concrete applications of the method, namely sparse learning and support vector classification. 1 Optimization by Variational Bounding We consider the general problem of function maximization, max x f(x) for vector x. When f is differentiable and x continuous, optimization methods that use gradient information are typically preferred over nongradient based approaches since they are able to take advantage of a locally optimal direction in which to search. However, in the case that f is not differentiable or x is discrete, gradient based approaches are not directly applicable. In that case, alternatives such as relaxation, coordinate-wise optimization and stochastic approaches are popular [1]. Our interest is to discuss another general class of methods that yield differentiable surrogate objectives for discrete x or non-differentiable f . The Variational Optimization (VO) approach is based on the simple bound f∗ = max x∈C f(x) ≥ 〈 f(x) 〉 p(x|θ) ≡ E(θ) (1) where 〈·〉p denotes expectation with respect to the distribution p defined over the solution space C. The parameters θ of the distribution p(x|θ) can then be adjusted to maximize the lower bound E(θ). This bound can be trivially made tight provided the distribution p(x|θ) is flexible enough to allow all its mass to be placed in the optimal state x∗ = argmaxx f(x). Under mild restrictions, the bound is differentiable, see section(1.1), and the bound is a smooth alternative objective function (see also section(4.1) on the relation to ‘smoothing’ methods). The degree of smoothness (and the deviation from the original objective) increases as the dispersion of the variational distribution increases. In section(1.2) we give sufficient conditions for the variational bound to be concave. The purpose of this paper is to demonstrate the ease with which VO can be applied and to discuss its merits as a general way to construct a smooth alternative objective. 1.1 Differentiability of the variational objective When f(x) is not differentiable, under weak conditions E(θ) can be made differentiable. The gradient of E(θ) is given by ∂E ∂θ = ∂ ∂θ ∫

[1]  Rich Caruana,et al.  Removing the Genetics from the Standard Genetic Algorithm , 1995, ICML.

[2]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[3]  Arnaud Berny Selection and Reinforcement Learning for Combinatorial Optimization , 2000, PPSN.

[4]  C. Lemaréchal Chapter VII Nondifferentiable optimization , 1989 .

[5]  David Barber,et al.  Concave Gaussian Variational Approximations for Inference in Large-Scale Bayesian Linear Models , 2011, AISTATS.

[6]  Editors , 1986, Brain Research Bulletin.

[7]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[8]  A. Berny,et al.  Statistical machine learning and combinatorial optimization , 2001 .

[9]  Shuiwang Ji,et al.  SLEP: Sparse Learning with Efficient Projections , 2011 .

[10]  David Barber,et al.  Bayesian reasoning and machine learning , 2012 .

[11]  Wenjiang J. Fu Penalized Regressions: The Bridge versus the Lasso , 1998 .

[12]  J. A. Lozano,et al.  Estimation of Distribution Algorithms: A New Tool for Evolutionary Computation , 2001 .

[13]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[14]  Chih-Jen Lin,et al.  Working Set Selection Using Second Order Information for Training Support Vector Machines , 2005, J. Mach. Learn. Res..

[15]  J. Nocedal Updating Quasi-Newton Matrices With Limited Storage , 1980 .

[16]  R. Tibshirani,et al.  PATHWISE COORDINATE OPTIMIZATION , 2007, 0708.1485.

[17]  Marcus Gallagher,et al.  Population-Based Continuous Optimization, Probabilistic Modelling and Mean Shift , 2005, Evolutionary Computation.

[18]  Olivier Chapelle,et al.  Training a Support Vector Machine in the Primal , 2007, Neural Computation.