Surrogate maximization/minimization algorithms and extensions

Abstract Surrogate maximization (or minimization) (SM) algorithms are a family of algorithms that can be regarded as a generalization of expectation-maximization (EM) algorithms. An SM algorithm aims at turning an otherwise intractable maximization problem into a tractable one by iterating two steps. The S-step computes a tractable surrogate function to substitute the original objective function and the M-step seeks to maximize this surrogate function. Convexity plays a central role in the S-step. SM algorithms enjoy the same convergence properties as EM algorithms. There are mainly three approaches to the construction of surrogate functions, namely, by using Jensen’s inequality, first-order Taylor approximation, and the low quadratic bound principle. In this paper, we demonstrate the usefulness of SM algorithms by taking logistic regression models, AdaBoost and the log-linear model as examples. More specifically, by using different surrogate function construction methods, we devise several SM algorithms, including the standard SM, generalized SM, gradient SM, and quadratic SM algorithms, and their two variants called the conditional surrogate maximization (CSM) and surrogate conditional maximization (SCM) algorithms.

[1]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[2]  R. Fletcher Practical Methods of Optimization , 1988 .

[3]  D. Rubin,et al.  The ECME algorithm: A simple extension of EM and ECM with faster monotone convergence , 1994 .

[4]  S. Lauritzen,et al.  The TM algorithm for maximising a conditional likelihood function , 2001 .

[5]  Manfred K. Warmuth,et al.  Boosting as entropy projection , 1999, COLT '99.

[6]  Xiao-Li Meng,et al.  Maximum likelihood estimation via the ECM algorithm: A general framework , 1993 .

[7]  丸山 徹 Convex Analysisの二,三の進展について , 1977 .

[8]  J. Darroch,et al.  Generalized Iterative Scaling for Log-Linear Models , 1972 .

[9]  Michael I. Jordan,et al.  A Variational Approach to Bayesian Logistic Regression Models and their Extensions , 1997, AISTATS.

[10]  Tom M. Mitchell,et al.  Learning to Extract Symbolic Knowledge from the World Wide Web , 1998, AAAI/IAAI.

[11]  Ruslan Salakhutdinov,et al.  Adaptive Overrelaxed Bound Optimization Methods , 2003, ICML.

[12]  Yoram Singer,et al.  Logistic Regression, AdaBoost and Bregman Distances , 2000, Machine Learning.

[13]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[14]  New York Dover,et al.  ON THE CONVERGENCE PROPERTIES OF THE EM ALGORITHM , 1983 .

[15]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[16]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[17]  John D. Lafferty,et al.  Boosting and Maximum Likelihood for Exponential Models , 2001, NIPS.

[18]  Alvaro R. De Pierro,et al.  A modified expectation maximization algorithm for penalized likelihood estimation in emission tomography , 1995, IEEE Trans. Medical Imaging.

[19]  Hans-Hermann Bock,et al.  Information Systems and Data Analysis , 1994 .

[20]  Alex Pentland,et al.  Maximum Conditional Likelihood via Bound Maximization and the CEM Algorithm , 1998, NIPS.

[21]  Jan de Leeuw,et al.  Block-relaxation Algorithms in Statistics , 1994 .

[22]  K. Lange A gradient algorithm locally equivalent to the EM algorithm , 1995 .

[23]  Xiao-Li Meng,et al.  [Optimization Transfer Using Surrogate Objective Functions]: Discussion , 2000 .

[24]  D. Hunter,et al.  Optimization Transfer Using Surrogate Objective Functions , 2000 .

[25]  S. D. Pietra,et al.  Duality and Auxiliary Functions for Bregman Distances , 2001 .

[26]  J. Lafferty Additive models, boosting, and inference for generalized divergences , 1999, COLT '99.

[27]  John D. Lafferty,et al.  Inducing Features of Random Fields , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[28]  A. Ostrowski Solution of equations and systems of equations , 1967 .

[29]  Yoram Singer,et al.  Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[30]  B. Lindsay,et al.  Monotonicity of quadratic-approximation algorithms , 1988 .

[31]  R. Schapire The Strength of Weak Learnability , 1990, Machine Learning.

[32]  Roger Fletcher,et al.  Practical methods of optimization; (2nd ed.) , 1987 .

[33]  T. Minka A comparison of numerical optimizers for logistic regression , 2004 .

[34]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[35]  K. Lange,et al.  EM algorithms without missing data , 1997, Statistical methods in medical research.

[36]  Alan L. Yuille,et al.  The Concave-Convex Procedure (CCCP) , 2001, NIPS.

[37]  P. Groenen,et al.  Modern multidimensional scaling , 1996 .