Distributed Machine Learning via Sufficient Factor Broadcasting

Matrix-parametrized models, including multiclass logistic regression and sparse coding, are used in machine learning (ML) applications ranging from computer vision to computational biology. When these models are applied to large-scale ML problems starting at millions of samples and tens of thousands of classes, their parameter matrix can grow at an unexpected rate, resulting in high parameter synchronization costs that greatly slow down distributed learning. To address this issue, we propose a Sufficient Factor Broadcasting (SFB) computation model for efficient distributed learning of a large family of matrix-parameterized models, which share the following property: the parameter update computed on each data sample is a rank-1 matrix, i.e., the outer product of two "sufficient factors" (SFs). By broadcasting the SFs among worker machines and reconstructing the update matrices locally at each worker, SFB improves communication efficiency --- communication costs are linear in the parameter matrix's dimensions, rather than quadratic --- without affecting computational correctness. We present a theoretical convergence analysis of SFB, and empirically corroborate its efficiency on four different matrix-parametrized ML models.

[1]  Alexander J. Smola,et al.  Scalable inference in latent variable models , 2012, WSDM '12.

[2]  Thomas Hofmann,et al.  Communication-Efficient Distributed Dual Coordinate Ascent , 2014, NIPS.

[3]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[4]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[5]  István Hegedüs,et al.  Gossip learning with linear models on fully distributed data , 2011, Concurr. Comput. Pract. Exp..

[6]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[7]  Georgios Paliouras,et al.  LSHTC: A Benchmark for Large-Scale Text Classification , 2015, ArXiv.

[8]  Vikas Sindhwani,et al.  Large-scale distributed non-negative sparse coding and sparse dictionary learning , 2012, KDD.

[9]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[10]  Yihong Gong,et al.  Locality-constrained Linear Coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[11]  Ohad Shamir,et al.  Communication-Efficient Distributed Optimization using an Approximate Newton-type Method , 2013, ICML.

[12]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[13]  Trishul M. Chilimbi,et al.  Project Adam: Building an Efficient and Scalable Deep Learning Training System , 2014, OSDI.

[14]  Ran Wolff,et al.  Distributed Decision‐Tree Induction in Peer‐to‐Peer Systems , 2008, Stat. Anal. Data Min..

[15]  John C. Duchi,et al.  Distributed delayed stochastic optimization , 2011, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[16]  Joseph Gonzalez,et al.  PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.

[17]  Michael I. Jordan,et al.  Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.

[18]  Marc Teboulle,et al.  A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[19]  Chih-Jen Lin,et al.  A dual coordinate descent method for large-scale linear SVM , 2008, ICML '08.

[20]  Seunghak Lee,et al.  On Model Parallelization and Scheduling Strategies for Distributed Machine Learning , 2014, NIPS.

[21]  Inderjit S. Dhillon,et al.  PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent , 2015, ICML.

[22]  K. Schittkowski,et al.  NONLINEAR PROGRAMMING , 2022 .

[23]  Seunghak Lee,et al.  More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server , 2013, NIPS.

[24]  Asim Kadav,et al.  MALT: distributed data-parallelism for existing ML applications , 2015, EuroSys.

[25]  Alexander J. Smola,et al.  Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[26]  Paul Smolensky,et al.  Information processing in dynamical systems: foundations of harmony theory , 1986 .

[27]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[28]  Shai Shalev-Shwartz,et al.  Stochastic dual coordinate ascent methods for regularized loss , 2012, J. Mach. Learn. Res..

[29]  Tianbao Yang,et al.  Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent , 2013, NIPS.

[30]  Eric P. Xing,et al.  High-Performance Distributed ML at Scale through Parameter Server Consistency Models , 2014, AAAI.

[31]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[32]  John N. Tsitsiklis,et al.  Parallel and distributed computation , 1989 .

[33]  David J. Field,et al.  Sparse coding with an overcomplete basis set: A strategy employed by V1? , 1997, Vision Research.

[34]  Doug Terry,et al.  Replicated data consistency explained through baseball , 2013, CACM.

[35]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[36]  Michael G. Rabbat,et al.  Communication/Computation Tradeoffs in Consensus-Based Distributed Optimization , 2012, NIPS.

[37]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[38]  Yiming Yang,et al.  Distributed training of Large-scale Logistic models , 2013, ICML.

[39]  Hillol Kargupta,et al.  A local asynchronous distributed privacy preserving feature selection algorithm for large peer-to-peer networks , 2009, Knowledge and Information Systems.

[40]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.