Sync-on-the-fly: A Parallel Framework for Gradient Descent Algorithms on Transient Resources

Many cloud service providers offer transient resources (i.e., spare servers) for a fraction of the cost of on-demand servers. Many big data analytics tasks composed of iterative computations are ideal to run on such transient resources. However, modern distributed data processing systems, such as MapReduce and Spark, provide little support for running iterative computation on transient resources. The fault-tolerant mechanism provided in MapReduce and Spark typically leads to cascading re-computations after revocations of transiently available resources. To address the problem, we propose a distributed framework, called Sync-on-the-fly, that takes advantage of the fact that many machine learning algorithms do not require fixed synchronization barriers. These synchronization barriers can be established at any time, such as immediately before workers running on transient servers are revoked. We adapt and implement widely used algorithms based on gradient descent, such as Logistic Regression and Matrix Factorization, as examples to illustrate Sync-on-the-fly’s approach. Our evaluation shows that Sync-on-the-fly can achieve up to 5× speedup over Spark and reduce 85% of the costs.