Large-scale logistic regression and linear support vector machines using spark

Logistic regression and linear SVM are useful methods for large-scale classification. However, their distributed implementations have not been well studied. Recently, because of the inefficiency of the MapReduce framework on iterative algorithms, Spark, an in-memory cluster-computing platform, has been proposed. It has emerged as a popular framework for large-scale data processing and analytics. In this work, we consider a distributed Newton method for solving logistic regression as well linear SVM and implement it on Spark. We carefully examine many implementation issues significantly affecting the running time and propose our solutions. After conducting thorough empirical investigations, we release an efficient and easy-to-use tool for the Spark community.

[1]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[2]  Chih-Jen Lin,et al.  Newton's Method for Large Bound-Constrained Optimization Problems , 1999, SIAM J. Optim..

[3]  Olvi L. Mangasarian,et al.  A finite newton method for classification , 2002, Optim. Methods Softw..

[4]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[5]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[6]  Chih-Jen Lin,et al.  Trust region Newton methods for large-scale logistic regression , 2007, ICML '07.

[7]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[8]  Chih-Jen Lin,et al.  Trust Region Newton Method for Logistic Regression , 2008, J. Mach. Learn. Res..

[9]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[10]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[11]  Kuan-Wei Wu,et al.  A Two-Stage Ensemble of Diverse Models for Advertisement Ranking in KDD Cup 2012 , 2012 .

[12]  Chia-Hua Ho,et al.  Recent Advances of Large-Scale Linear Classification , 2012, Proceedings of the IEEE.

[13]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[14]  Wei-Cheng Chang A Revisit to Support Vector Data Description ( SVDD ) , 2013 .

[15]  John Langford,et al.  A reliable effective terascale linear learning system , 2011, J. Mach. Learn. Res..

[16]  Cheng-Hao Tsai,et al.  Incremental and decremental training for linear classification , 2014, KDD.

[17]  Chih-Jen Lin,et al.  Large-Scale Linear RankSVM , 2014, Neural Computation.

[18]  Dan Roth,et al.  Distributed Training of Structured SVM , 2015, ArXiv.

[19]  Dan Roth,et al.  Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM , 2015, ICML.

[20]  Chih-Jen Lin,et al.  Distributed Newton Methods for Regularized Logistic Regression , 2015, PAKDD.

[21]  Ching-pei Lee From Dual to Primal Sub-optimality for Regularized Empirical Risk Minimization , 2016 .