Selective block minimization for faster convergence of limited memory large-scale linear models

As the size of data sets used to build classifiers steadily increases, training a linear model efficiently with limited memory becomes essential. Several techniques deal with this problem by loading blocks of data from disk one at a time, but usually take a considerable number of iterations to converge to a reasonable model. Even the best block minimization techniques [1] require many block loads since they treat all training examples uniformly. As disk I/O is expensive, reducing the amount of disk access can dramatically decrease the training time. This paper introduces a selective block minimization (SBM) algorithm, a block minimization method that makes use of selective sampling. At each step, SBM updates the model using data consisting of two parts: (1) new data loaded from disk and (2) a set of informative samples already in memory from previous steps. We prove that, by updating the linear model in the dual form, the proposed method fully utilizes the data in memory and converges to a globally optimal solution on the entire data. Experiments show that the SBM algorithm dramatically reduces the number of blocks loaded from disk and consequently obtains an accurate and stable model quickly on both binary and multi-class classification.

[1]  P. Tseng,et al.  On the convergence of the coordinate descent method for convex differentiable minimization , 1992 .

[2]  Yoram Singer,et al.  Using and combining predictors that specialize , 1997, STOC '97.

[3]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[4]  Osamu Watanabe,et al.  A Random Sampling Technique for Training Support Vector Machines , 2001, ALT.

[5]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[6]  Dan Roth,et al.  Constraint Classification for Multiclass Classification and Ranking , 2002, NIPS.

[7]  Dan Roth,et al.  Constraint Classification: A New Approach to Multiclass Classification , 2002, ALT.

[8]  Jiawei Han,et al.  Classifying large data sets using SVMs with hierarchical clusters , 2003, KDD '03.

[9]  Yann LeCun,et al.  Large Scale Online Learning , 2003, NIPS.

[10]  Koby Crammer,et al.  Online Passive-Aggressive Algorithms , 2003, J. Mach. Learn. Res..

[11]  Brian Roark,et al.  Incremental Parsing with the Perceptron Algorithm , 2004, ACL.

[12]  Koby Crammer,et al.  On the Learnability and Design of Output Codes for Multiclass Problems , 2002, Machine Learning.

[13]  Antonio Artés-Rodríguez,et al.  Double Chunking for Solving SVMs for Very Large Datasets , 2004 .

[14]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[15]  Jason Weston,et al.  Fast Kernel Classifiers with Online and Active Learning , 2005, J. Mach. Learn. Res..

[16]  L. Bottou,et al.  Training Invariant Support Vector Machines using Selective Sampling , 2005 .

[17]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[18]  Katya Scheinberg,et al.  An Efficient Implementation of an Active Set Method for SVMs , 2006, J. Mach. Learn. Res..

[19]  Edward Y. Chang,et al.  Parallelizing Support Vector Machines on Distributed Computers , 2007, NIPS.

[20]  Chih-Jen Lin,et al.  A dual coordinate descent method for large-scale linear SVM , 2008, ICML '08.

[21]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[22]  Koby Crammer,et al.  Confidence-weighted linear classification , 2008, ICML '08.

[23]  Nathan Srebro,et al.  SVM optimization: inverse dependence on training set size , 2008, ICML '08.

[24]  Chih-Jen Lin,et al.  A sequential dual method for large scale multi-class linear svms , 2008, KDD.

[25]  Yoram Singer,et al.  The Forgetron: A Kernel-Based Perceptron on a Budget , 2008, SIAM J. Comput..

[26]  John Langford,et al.  Sparse Online Learning via Truncated Gradient , 2008, NIPS.

[27]  John Langford,et al.  Slow Learners are Fast , 2009, NIPS.

[28]  Jacek Gondzio,et al.  Hybrid MPI/OpenMP Parallel Linear Support Vector Machine Training , 2009, J. Mach. Learn. Res..

[29]  Suresh Venkatasubramanian,et al.  Streamed Learning: One-Pass SVMs , 2009, IJCAI.

[30]  Koby Crammer,et al.  Multi-Class Confidence Weighted Algorithms , 2009, EMNLP.

[31]  Dan Roth,et al.  Generating Confusion Sets for Context-Sensitive Error Correction , 2010, EMNLP.

[32]  Alexander J. Smola,et al.  Parallelized Stochastic Gradient Descent , 2010, NIPS.

[33]  Chih-Jen Lin,et al.  Dual coordinate descent methods for logistic regression and maximum entropy models , 2011, Machine Learning.

[34]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[35]  Chih-Jen Lin,et al.  Large Linear Classification When Data Cannot Fit in Memory , 2011, TKDD.