A Difference of Convex Functions Approach to Large-Scale Log-Linear Model Estimation

We introduce a new class of parameter estimation methods for log-linear models. Our approach relies on the fact that minimizing a rational function of mixtures of exponentials is equivalent to minimizing a difference of convex functions. This allows us to construct convex auxiliary functions by applying the concave-convex procedure (CCCP). We consider a modification of CCCP where a proximal term is added (ProxCCCP), and extend it further by introducing an ℓ1 penalty. For solving the ` convex + ℓ1' auxiliary problem, we propose an approach called SeqGPSR that is based on sequential application of the GPSR procedure. We present convergence analysis of the algorithms, including sufficient conditions for convergence to a critical point of the objective function. We propose an adaptive procedure for varying the strength of the proximal regularization term in each ProxCCCP iteration, and show this procedure (AProxCCCP) is effective in practice and stable under some mild conditions. The CCCP procedure and proposed variants are applied to the task of optimizing the cross-entropy objective function for an audio frame classification problem. Class posteriors are modeled using log-linear models consisting of approximately 6 million parameters. Our results show that CCCP variants achieve a much better cross-entropy objective value as compared to direct optimization of the objective function by a first order gradient based approach, stochastic gradient descent or the L-BFGS procedure.

[1]  Jean Dieudonné Sur le théorème de Grace et les relations algébriques analogues , 1932 .

[2]  E. M. L. Beale,et al.  Nonlinear Programming: A Unified Approach. , 1970 .

[3]  J. Toland Duality in nonconvex optimization , 1978 .

[4]  J. Toland On sub-differential calculus and duality in non-convex optimisation , 1979 .

[5]  D. Bertsekas Projected Newton methods for optimization problems with simple constraints , 1981, 1981 20th IEEE Conference on Decision and Control including the Symposium on Adaptive Processes.

[6]  J. Hiriart-Urruty Generalized Differentiability / Duality and Optimization for Problems Dealing with Differences of Convex Functions , 1985 .

[7]  Pham Dinh Tao,et al.  Duality in D.C. (Difference of Convex functions) Optimization. Subgradient Methods , 1988 .

[8]  J.-B. Hiriart-Urruty,et al.  From Convex Optimization to Nonconvex Optimization. Necessary and Sufficient Conditions for Global Optimality , 1989 .

[9]  Dimitri Kanevsky,et al.  An inequality for rational functions with applications to some statistical estimation problems , 1991, IEEE Trans. Inf. Theory.

[10]  T. P. Dinh,et al.  Convex analysis approach to d.c. programming: Theory, Algorithm and Applications , 1997 .

[11]  John D. Lafferty,et al.  Inducing Features of Random Fields , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  H. Tuy,et al.  D.C. optimization approach to robust control: Feasibility problems , 2000 .

[13]  Robert D. Nowak,et al.  An EM algorithm for wavelet-based image restoration , 2003, IEEE Trans. Image Process..

[14]  Wen-yuSun,et al.  PROXIMAL POINT ALGORITHM FOR MINIMIZATION OF DC FUNCTION , 2003 .

[15]  Alan L. Yuille,et al.  The Concave-Convex Procedure , 2003, Neural Computation.

[16]  Patrick L. Combettes,et al.  Signal Recovery by Proximal Forward-Backward Splitting , 2005, Multiscale Model. Simul..

[17]  Alex Acero,et al.  Hidden conditional random fields for phone classification , 2005, INTERSPEECH.

[18]  Michael Elad,et al.  Why Simple Shrinkage Is Still Relevant for Redundant Representations? , 2006, IEEE Transactions on Information Theory.

[19]  Lawrence K. Saul,et al.  Large Margin Gaussian Mixture Modeling for Phonetic Classification and Recognition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[20]  Mário A. T. Figueiredo,et al.  Gradient Projection for Sparse Reconstruction: Application to Compressed Sensing and Other Inverse Problems , 2007, IEEE Journal of Selected Topics in Signal Processing.

[21]  T. Le-Ngoc,et al.  Global D.C. Optimization for Multi-User Interference Systems , 2007, 2007 2nd IEEE International Workshop on Computational Advances in Multi-Sensor Adaptive Processing.

[22]  Jianfeng Gao,et al.  Scalable training of L1-regularized log-linear models , 2007, ICML '07.

[23]  Wu Chou,et al.  Discriminative learning in sequential pattern recognition , 2008, IEEE Signal Processing Magazine.

[24]  Thorsten Joachims,et al.  Learning structural SVMs with latent variables , 2009, ICML '09.

[25]  Hédy Attouch,et al.  On the convergence of the proximal algorithm for nonsmooth functions involving analytic features , 2008, Math. Program..

[26]  Stephen J. Wright,et al.  Sparse Reconstruction by Separable Approximation , 2008, IEEE Transactions on Signal Processing.

[27]  Geoffrey Zweig,et al.  A segmental CRF approach to large vocabulary continuous speech recognition , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[28]  James Martens,et al.  Deep learning via Hessian-free optimization , 2010, ICML.

[29]  Georg Heigold,et al.  A log-linear discriminative modeling framework for speech recognition , 2010 .

[30]  Quoc V. Le,et al.  On optimization methods for deep learning , 2011, ICML.

[31]  Georg Heigold,et al.  Latent Log-Linear Models for Handwritten Digit Classification , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.