FTSGD: An Adaptive Stochastic Gradient Descent Algorithm for Spark MLlib

The proliferation of massive datasets and the surge of interests in big data analytics have popularized a number of novel distributed data processing platforms such as Hadoop and Spark. Their large and growing ecosystems of libraries enable even novice to take advantage of the latest data analytics and machine learning algorithms. However, time-consuming data synchronization and communications in iterative algorithms on large-scale distributed platforms can lead to significant performance inefficiency. MLlib is Spark's scalable library consisting of common machine learning algorithms, many of which employ Stochastic Gradient Descent (SGD) to find minima or maxima by iterations. However, the convergence can be very slow if gradient data are synchronized on each iteration. In this work, we optimize the current implementation of SGD in Spark's MLlib by reusing data partition for multiple times within a single iteration to find better candidate weights in a more efficient way. Whether using multiple local iterations within each partition is dynamically decided by the 68-95-99.7 rule. We also design a variant of momentum algorithm to optimize step size in every iteration. This method uses a new adaptive rule that decreases the step size whenever neighboring gradients show differing directions of significance. Experiments show that our adaptive algorithm is more efficient and can be 7 times faster compared to the original MLlib's SGD.

[1]  Hong Zhang,et al.  MRapid: An Efficient Short Job Optimizer on Hadoop , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[2]  He Huang,et al.  A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Tomographic Problems: A Case Study in Seismic Tomography , 2013, ICCS.

[3]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[4]  Hong Zhang,et al.  Hierarchical Spark: A Multi-Cluster Big Data Computing Framework , 2017, 2017 IEEE 10th International Conference on Cloud Computing (CLOUD).

[5]  Rong Zheng,et al.  Asynchronous stochastic gradient descent for DNN training , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Heng Huang,et al.  Asynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization , 2017, AAAI.

[7]  Thomas Paine,et al.  GPU Asynchronous Stochastic Gradient Descent to Speed Up Neural Network Training , 2013, ICLR.

[8]  Heng Huang,et al.  Asynchronous Stochastic Gradient Descent with Variance Reduction for Non-Convex Optimization , 2016, AAAI 2016.

[9]  John Mark Bishop,et al.  Simple adaptive momentum: New algorithm for training multilayer perceptrons , 1994 .

[10]  Aaron Q. Li,et al.  Parameter Server for Distributed Machine Learning , 2013 .

[11]  Stephen J. Wright,et al.  An asynchronous parallel stochastic coordinate descent algorithm , 2013, J. Mach. Learn. Res..

[12]  He Huang,et al.  A model-driven partitioning and auto-tuning integrated framework for sparse matrix-vector multiplication on GPUs , 2011 .

[13]  Hong Zhang,et al.  Dart: A Geographic Information System on Hadoop , 2015, 2015 IEEE 8th International Conference on Cloud Computing.

[14]  Robert A. Jacobs,et al.  Increased rates of convergence through learning rate adaptation , 1987, Neural Networks.

[15]  Cheng-Hao Tsai,et al.  Large-scale logistic regression and linear support vector machines using spark , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[16]  Wu-Jun Li,et al.  Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee , 2016, AAAI.

[17]  Alexander J. Smola,et al.  Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[18]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[19]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[20]  Hong Zhang,et al.  SMARTH: Enabling Multi-pipeline Data Transfer in HDFS , 2014, 2014 43rd International Conference on Parallel Processing.

[21]  Nazri Mohd Nawi,et al.  An Improved Conjugate Gradient Based Learning Algorithm for Back Propagation Neural Networks , 2008 .

[22]  Suyog Gupta,et al.  Model Accuracy and Runtime Tradeoff in Distributed Deep Learning: A Systematic Study , 2015, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[23]  Cho-Jui Hsieh,et al.  HogWild++: A New Mechanism for Decentralized Asynchronous Stochastic Gradient Descent , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[24]  Alexander J. Smola,et al.  Parallelized Stochastic Gradient Descent , 2010, NIPS.

[25]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.