论文信息 - Sync-on-the-fly: A Parallel Framework for Gradient Descent Algorithms on Transient Resources

Sync-on-the-fly: A Parallel Framework for Gradient Descent Algorithms on Transient Resources

Many cloud service providers offer transient resources (i.e., spare servers) for a fraction of the cost of on-demand servers. Many big data analytics tasks composed of iterative computations are ideal to run on such transient resources. However, modern distributed data processing systems, such as MapReduce and Spark, provide little support for running iterative computation on transient resources. The fault-tolerant mechanism provided in MapReduce and Spark typically leads to cascading re-computations after revocations of transiently available resources. To address the problem, we propose a distributed framework, called Sync-on-the-fly, that takes advantage of the fact that many machine learning algorithms do not require fixed synchronization barriers. These synchronization barriers can be established at any time, such as immediately before workers running on transient servers are revoked. We adapt and implement widely used algorithms based on gradient descent, such as Logistic Regression and Matrix Factorization, as examples to illustrate Sync-on-the-fly’s approach. Our evaluation shows that Sync-on-the-fly can achieve up to 5× speedup over Spark and reduce 85% of the costs.

Lixin Gao | David E. Irwin | Guoyi Zhao

[1] Seunghak Lee,et al. More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server , 2013, NIPS.

[2] Aart J. C. Bik,et al. Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[3] Xin He,et al. Flint: batch-interactive data-intensive processing on transient servers , 2016, EuroSys.

[4] Hairong Kuang,et al. The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[5] Yang Chen,et al. TR-Spark: Transient Computing for Big Data Analytics , 2016, SoCC.

[6] Scott Shenker,et al. Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[7] Zhengping Qian,et al. Pado: A Data Processing Engine for Harnessing Transient Resources in Datacenters , 2017, EuroSys.

[8] Lixin Gao,et al. Accelerating Expectation-Maximization Algorithms with Frequent Updates , 2012, 2012 IEEE International Conference on Cluster Computing.

[9] Patrick Seemann,et al. Matrix Factorization Techniques for Recommender Systems , 2014 .

[10] Ge Yu,et al. FSP: towards flexible synchronous parallel framework for expectation-maximization based algorithms on cloud , 2017, SoCC.

[11] Alexander J. Smola,et al. Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[12] Prateek Sharma,et al. Here Today, Gone Tomorrow: Exploiting Transient Servers in Datacenters , 2014, IEEE Internet Computing.