Large-Scale Learning with Less RAM via Randomization

We reduce the memory footprint of popular large-scale online learning methods by projecting our weight vector onto a coarse discrete set using randomized rounding. Compared to standard 32-bit float encodings, this reduces RAM usage by more than 50% during training and by up to 95% when making predictions from a fixed model, with almost no loss in accuracy. We also show that randomized counting can be used to implement percoordinate learning rates, improving model quality with little additional RAM. We prove these memory-saving methods achieve regret guarantees similar to their exact variants. Empirical evaluation confirms excellent performance, dominating standard approaches across memory versus accuracy tradeoffs.

[1]  Robert H. Morris,et al.  Counting large numbers of events in small registers , 1978, CACM.

[2]  Philippe Flajolet,et al.  Approximate counting: A detailed analysis , 1985, BIT.

[3]  Prabhakar Raghavan,et al.  Randomized rounding: A technique for provably good algorithms and algorithmic proofs , 1985, Comb..

[4]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[5]  John Langford,et al.  Beating the hold-out: bounds for K-fold and progressive cross-validation , 1999, COLT '99.

[6]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[7]  Rich Caruana,et al.  Model compression , 2006, KDD '06.

[8]  Fan Chung Graham,et al.  Concentration Inequalities and Martingale Inequalities: A Survey , 2006, Internet Math..

[9]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[10]  Matthew Richardson,et al.  Predicting clicks: estimating the click-through rate for new ads , 2007, WWW '07.

[11]  Gordon V. Cormack,et al.  Spam and the ongoing battle for the inbox , 2007, CACM.

[12]  Nick Craswell,et al.  An experimental comparison of click position-bias models , 2008, WSDM '08.

[13]  Guy E. Blelloch,et al.  Compact dictionaries for variable-length keys and data with applications , 2008, TALG.

[14]  Yoram Singer,et al.  Efficient projections onto the l1-ball for learning in high dimensions , 2008, ICML '08.

[15]  John Langford,et al.  Sparse Online Learning via Truncated Gradient , 2008, NIPS.

[16]  Kilian Q. Weinberger,et al.  Feature hashing for large scale multitask learning , 2009, ICML '09.

[17]  Lin Xiao,et al.  Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization , 2009, J. Mach. Learn. Res..

[18]  Benjamin Van Durme,et al.  Probabilistic Counting with Randomized Storage , 2009, IJCAI.

[19]  Mikkel Thorup,et al.  String hashing for linear probing , 2009, SODA.

[20]  Matthew J. Streeter,et al.  Adaptive Bound Optimization for Online Convex Optimization , 2010, COLT 2010.

[21]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[22]  Matthew J. Streeter,et al.  Less Regret via Online Conditioning , 2010, ArXiv.

[23]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[24]  H. Brendan McMahan,et al.  Follow-the-Regularized-Leader and Mirror Descent: Equivalence Theorems and L1 Regularization , 2011, AISTATS.

[25]  Matthew Richardson,et al.  Predictive client-side profiles for personalized advertising , 2011, KDD.

[26]  Shai Shalev-Shwartz,et al.  Online Learning and Online Convex Optimization , 2012, Found. Trends Mach. Learn..