Regularization Techniques for Learning with Matrices

There is growing body of learning problems for which it is natural to organize the parameters into a matrix. As a result, it becomes easy to impose sophisticated prior knowledge by appropriately regularizing the parameters under some matrix norm. This work describes and analyzes a systematic method for constructing such matrix-based regularization techniques. In particular, we focus on how the underlying statistical properties of a given problem can help us decide which regularization function is appropriate. Our methodology is based on a known duality phenomenon: a function is strongly convex with respect to some norm if and only if its conjugate function is strongly smooth with respect to the dual norm. This result has already been found to be a key component in deriving and analyzing several learning algorithms. We demonstrate the potential of this framework by deriving novel generalization and regret bounds for multi-task learning, multi-class learning, and multiple kernel learning.

[1]  G. Pisier Martingales with values in uniformly convex spaces , 1975 .

[2]  丸山 徹 Convex Analysisの二,三の進展について , 1977 .

[3]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[4]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[5]  K. Ball,et al.  Sharp uniform convexity and smoothness inequalities for trace norms , 1994 .

[6]  I. Pinelis OPTIMUM BOUNDS FOR THE DISTRIBUTIONS OF MARTINGALES IN BANACH SPACES , 1994, 1208.2200.

[7]  A. Lewis The Convex Analysis of Unitarily Invariant Matrix Functions , 1995 .

[8]  Dale Schuurmans,et al.  General Convergence Results for Linear Discriminant Updates , 1997, COLT '97.

[9]  Manfred K. Warmuth,et al.  Exponentiated Gradient Versus Gradient Descent for Linear Predictors , 1997, Inf. Comput..

[10]  Claudio Gentile,et al.  The Robustness of the p-Norm Algorithms , 1999, COLT '99.

[11]  Adrian S. Lewis,et al.  Convex Analysis And Nonlinear Optimization , 2000 .

[12]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[13]  C. Zălinescu Convex analysis in general vector spaces , 2002 .

[14]  Ron Meir,et al.  Generalization Error Bounds for Bayesian Mixture Algorithms , 2003, J. Mach. Learn. Res..

[15]  Nello Cristianini,et al.  Learning the Kernel Matrix with Semidefinite Programming , 2002, J. Mach. Learn. Res..

[16]  Manfred K. Warmuth,et al.  Relative Loss Bounds for Multidimensional Regression Problems , 1997, Machine Learning.

[17]  Koby Crammer,et al.  On the Learnability and Design of Output Codes for Multiclass Problems , 2002, Machine Learning.

[18]  Tommi S. Jaakkola,et al.  Maximum-Margin Matrix Factorization , 2004, NIPS.

[19]  A. Ng Feature selection, L1 vs. L2 regularization, and rotational invariance , 2004, Twenty-first international conference on Machine learning - ICML '04.

[20]  Yoram Singer,et al.  Convex Repeated Games and Fenchel Duality , 2006, NIPS.

[21]  Andreas Maurer,et al.  Bounds for Linear Multi-Task Learning , 2006, J. Mach. Learn. Res..

[22]  Shai Ben-David,et al.  Learning Bounds for Support Vector Machines with Learned Kernels , 2006, COLT.

[23]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[24]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[25]  Yoram Singer,et al.  A primal-dual perspective of online learning algorithms , 2007, Machine Learning.

[26]  Shimon Ullman,et al.  Uncovering shared structures in multiclass classification , 2007, ICML '07.

[27]  Massimiliano Pontil,et al.  Convex multi-task feature learning , 2008, Machine Learning.

[28]  G. Obozinski Joint covariate selection for grouped classification , 2007 .

[29]  Shai Shalev-Shwartz,et al.  Online learning: theory, algorithms and applications (למידה מקוונת.) , 2007 .

[30]  Francis R. Bach,et al.  Consistency of the group Lasso and multiple kernel learning , 2007, J. Mach. Learn. Res..

[31]  Peter L. Bartlett,et al.  Matrix regularization techniques for online multitask learning , 2008 .

[32]  Ambuj Tewari,et al.  On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization , 2008, NIPS.

[33]  A. Juditsky,et al.  Large Deviations of Vector-valued Martingales in 2-Smooth Normed Spaces , 2008, 0809.0813.

[34]  Claudio Gentile,et al.  Linear Algorithms for Online Multitask Classification , 2010, COLT.

[35]  Massimiliano Pontil,et al.  Taking Advantage of Sparsity in Multi-Task Learning , 2009, COLT.

[36]  Eric P. Xing,et al.  Heterogeneous multitask learning with joint sparsity constraints , 2009, NIPS.

[37]  C. Campbell,et al.  Generalization bounds for learning the kernel , 2009 .

[38]  Charles A. Micchelli,et al.  On Spectral Learning , 2010, J. Mach. Learn. Res..

[39]  Ben Taskar,et al.  Joint covariate selection and joint subspace selection for multiple classification problems , 2010, Stat. Comput..

[40]  Yoram Singer,et al.  On the equivalence of weak learnability and linear separability: new relaxations and efficient boosting algorithms , 2010, Machine Learning.

[41]  Convex Games in Banach Spaces , 2010, COLT.

[42]  Manfred K. Warmuth,et al.  Online variance minimization , 2011, Machine Learning.