An Efficient Alternating Newton Method for Learning Factorization Machines

To date, factorization machines (FMs) have emerged as a powerful model in many applications. In this work, we study the training of FM with the logistic loss for binary classification, which is a nonlinear extension of the linear model with the logistic loss (i.e., logistic regression). For the training of large-scale logistic regression, Newton methods have been shown to be an effective approach, but it is difficult to apply such methods to FM because of the nonconvexity. We consider a modification of FM that is multiblock convex and propose an alternating minimization algorithm based on Newton methods. Some novel optimization techniques are introduced to reduce the running time. Our experiments demonstrate that the proposed algorithm is more efficient than stochastic gradient algorithms and coordinate descent methods. The parallelism of our method is also investigated for the acceleration in multithreading environments.

[1]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[2]  Chih-Jen Lin,et al.  Iterative Scaling and Coordinate Descent Methods for Maximum Entropy , 2009, ACL.

[3]  Hai Yang,et al.  ACM Transactions on Intelligent Systems and Technology - Special Section on Urban Computing , 2014 .

[4]  Homer F. Walker,et al.  Globally Convergent Inexact Newton Methods , 1994, SIAM J. Optim..

[5]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[6]  James Martens,et al.  Deep learning via Hessian-free optimization , 2010, ICML.

[7]  Chih-Jen Lin,et al.  A Comparison of Optimization Methods and Software for Large-scale L1-regularized Linear Classification , 2010, J. Mach. Learn. Res..

[8]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[9]  K. Schittkowski,et al.  NONLINEAR PROGRAMMING , 2022 .

[10]  Chih-Jen Lin,et al.  Field-aware Factorization Machines for CTR Prediction , 2016, RecSys.

[11]  Chih-Jen Lin,et al.  Trust region Newton methods for large-scale logistic regression , 2007, ICML '07.

[12]  Chih-Jen Lin,et al.  Linear and Kernel Classification: When to Use Which? , 2016, SDM.

[13]  Naonori Ueda,et al.  Polynomial Networks and Factorization Machines: New Insights and Efficient Training Algorithms , 2016, ICML.

[14]  Steffen Rendle,et al.  Factorization Machines with libFM , 2012, TIST.

[15]  Cho-Jui Hsieh,et al.  Coordinate Descent Method for Large-scale L 2-loss Linear SVM , 2008 .

[16]  Chih-Jen Lin,et al.  Projected Gradient Methods for Nonnegative Matrix Factorization , 2007, Neural Computation.

[17]  Jorge Nocedal,et al.  On the Use of Stochastic Hessian Information in Optimization Methods for Machine Learning , 2011, SIAM J. Optim..

[18]  Chih-Jen Lin,et al.  Fast Matrix-Vector Multiplications for Large-Scale Logistic Regression on Shared-Memory Systems , 2015, 2015 IEEE International Conference on Data Mining.

[19]  Chia-Hua Ho,et al.  Large-scale linear support vector regression , 2012, J. Mach. Learn. Res..

[20]  Chih-Jen Lin,et al.  Distributed Newton Methods for Regularized Logistic Regression , 2015, PAKDD.

[21]  Chih-Jen Lin,et al.  Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines , 2008, J. Mach. Learn. Res..

[22]  Chia-Hua Ho,et al.  An improved GLMNET for l1-regularized logistic regression , 2011, J. Mach. Learn. Res..

[23]  Steffen Rendle,et al.  Factorization Machines , 2010, 2010 IEEE International Conference on Data Mining.

[24]  Phillipp Kaestner,et al.  Linear And Nonlinear Programming , 2016 .

[25]  S. Nash A survey of truncated-Newton methods , 2000 .

[26]  Nicol N. Schraudolph,et al.  Fast Curvature Matrix-Vector Products for Second-Order Gradient Descent , 2002, Neural Computation.

[27]  Chih-Jen Lin,et al.  Subsampled Hessian Newton Methods for Supervised Learning , 2015, Neural Computation.

[28]  Anh-Phuong Ta,et al.  Factorization machines with follow-the-regularized-leader for CTR prediction in display advertising , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[29]  Chih-Jen Lin,et al.  Trust Region Newton Method for Logistic Regression , 2008, J. Mach. Learn. Res..

[30]  S. Sathiya Keerthi,et al.  A Modified Finite Newton Method for Fast Solution of Large Scale Linear SVMs , 2005, J. Mach. Learn. Res..

[31]  Jorge Nocedal,et al.  Sample size selection in optimization methods for machine learning , 2012, Math. Program..

[32]  Chih-Jen Lin,et al.  Iterative Scaling and Coordinate Descent Methods for Maximum Entropy , 2009, ACL/IJCNLP.

[33]  Chia-Hua Ho,et al.  Recent Advances of Large-Scale Linear Classification , 2012, Proceedings of the IEEE.

[34]  Dennis M. Wilkinson,et al.  Large-Scale Parallel Collaborative Filtering for the Netflix Prize , 2008, AAIM.