CHEAPS2AGA: Bounding Space Usage in Variance-Reduced Stochastic Gradient Descent over Streaming Data and Its Asynchronous Parallel Variants

Stochastic Gradient Descent (SGD) is widely used to train a machine learning model over large datasets, yet its slow convergence rate can be a bottleneck. As a remarkable family of variance reduction techniques, memory algorithms such as SAG and SAGA have been proposed to accelerate the convergence rate of SGD. However, these algorithms need to store per training data point corrections in memory. The unlimited space usage feature is impractical for modern large-scale applications, especially over data points that arrive over time (referred to as streaming data in this paper). To overcome this weakness, this paper investigates the methods that bound the space usage in the state-of-the-art family of variance-reduced stochastic gradient descent over streaming data, and presents CHEAPS2AGA. At each step of updating the model, the key idea of CHEAPS2AGA is always reserving N random data points as samples, while re-using information about past stochastic gradients across all the observed data points with limited space usage. In addition, training an accurate model over streaming data requires the algorithm to be time-efficient. To accelerate the model training phase, CHEAPS2AGA embraces a lock-free data structure to insert new data points and remove unused data points in parallel, and updates the model parameters without using any locking. We conduct comprehensive experiments to compare CHEAPS2AGA to prior related algorithms suited for streaming data. The experimental results demonstrate the practical competitiveness of CHEAPS2AGA in terms of scalability and accuracy.

[1]  Cho-Jui Hsieh,et al.  HogWild++: A New Mechanism for Decentralized Asynchronous Stochastic Gradient Descent , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[2]  Srikanta Tirthapura,et al.  Variance-Reduced Stochastic Gradient Descent on Streaming Data , 2018, NeurIPS.

[3]  H. Robbins A Stochastic Approximation Method , 1951 .

[4]  Pu Zhao,et al.  Towards Query-Efficient Black-Box Adversary with Zeroth-Order Natural Gradient Descent , 2020, AAAI.

[5]  Xiaochun Yun,et al.  Lock-Free Parallelization for Variance-Reduced Stochastic Gradient Descent on Streaming Data , 2020, IEEE Transactions on Parallel and Distributed Systems.

[6]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[7]  Anastasios Kyrillidis,et al.  Trading-off variance and complexity in stochastic gradient descent , 2016, ArXiv.

[8]  Sham M. Kakade,et al.  Competing with the Empirical Risk Minimizer in a Single Pass , 2014, COLT.

[9]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[10]  Peter Richtárik,et al.  Semi-Stochastic Gradient Descent Methods , 2013, Front. Appl. Math. Stat..

[11]  Yann LeCun,et al.  Large Scale Online Learning , 2003, NIPS.

[12]  Fabian Pedregosa,et al.  ASAGA: Asynchronous Parallel SAGA , 2016, AISTATS.

[13]  Thomas Hofmann,et al.  Starting Small - Learning with Adaptive Sample Sizes , 2016, ICML.

[14]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[15]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[16]  Jaehoon Lee,et al.  Wide neural networks of any depth evolve as linear models under gradient descent , 2019, NeurIPS.

[17]  Aurélien Lucchi,et al.  Variance Reduced Stochastic Gradient Descent with Neighbors , 2015, NIPS.

[18]  Alexander J. Smola,et al.  On Variance Reduction in Stochastic Gradient Descent and its Asynchronous Variants , 2015, NIPS.

[19]  Lawrence K. Saul,et al.  Identifying suspicious URLs: an application of large-scale online learning , 2009, ICML '09.

[20]  Mark W. Schmidt,et al.  A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets , 2012, NIPS.