How to optimally allocate resources for coded distributed computing?

To execute cloud computing tasks over a data center hosting hundreds of thousands of server nodes, it is natural to distribute computations across the nodes to take advantage of parallel processing. However, as we allocate more computing resources and further distribute the computations, a large amount of intermediate data must be moved between consecutive computation stages among the nodes, causing the communication load to become the bottleneck. In this paper, we study the optimal resource allocation in distributed computing, in order to minimize the total execution time accounting for the durations of both computation and communication phases. Particularly, we consider a general MapReduce-type framework, and focus on a recently proposed Coded Distributed Computing approach. For all values of problem parameters, we characterize the optimal number of servers that should be used for computing, provide the optimal placements of the Map and Reduce tasks, and propose an optimal coded data shuffling scheme. To prove the optimality of the proposed scheme, we first derive a matching information-theoretic converse on the execution time, then we prove that among all resource allocation schemes that achieve the minimum execution time, our proposed scheme uses the exactly least number of servers.

[1]  A. Salman Avestimehr,et al.  The Exact Rate-Memory Tradeoff for Caching With Uncoded Prefetching , 2016, IEEE Transactions on Information Theory.

[2]  A. Salman Avestimehr,et al.  A Scalable Framework for Wireless Distributed Computing , 2016, IEEE/ACM Transactions on Networking.

[3]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[4]  Roy H. Campbell,et al.  ARIA: automatic resource inference and allocation for mapreduce environments , 2011, ICAC '11.

[5]  Urs Niesen,et al.  Fundamental limits of caching , 2012, 2013 IEEE International Symposium on Information Theory.

[6]  Mohammad Ali Maddah-Ali,et al.  A Unified Coding Framework for Distributed Computing with Straggling Servers , 2016, 2016 IEEE Globecom Workshops (GC Wkshps).

[7]  Liana L. Fong,et al.  Neptune: A Dynamic Resource Allocation and Planning System for a Cluster Computing Utility , 2002, 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGRID'02).

[8]  Amir Salman Avestimehr,et al.  Coded computation over heterogeneous clusters , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).

[9]  Mohammad Ali Maddah-Ali,et al.  Coded TeraSort , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[10]  Bhaskar Krishnamachari,et al.  Hermes: Latency Optimal Task Assignment for Resource-constrained Mobile Computing , 2017, IEEE Transactions on Mobile Computing.

[11]  A. Salman Avestimehr,et al.  A Fundamental Tradeoff Between Computation and Communication in Distributed Computing , 2016, IEEE Transactions on Information Theory.

[12]  Randy H. Katz,et al.  Heterogeneity-Aware Resource Allocation and Scheduling in the Cloud , 2011, HotCloud.

[13]  Mohammad Ali Maddah-Ali,et al.  Fundamental tradeoff between computation and communication in distributed computing , 2016, ISIT.

[14]  Daniela Tuninetti,et al.  On caching with more users than files , 2016, 2016 IEEE International Symposium on Information Theory (ISIT).

[15]  Vladimir Getov,et al.  Intelligent architecture for automatic resource allocation in computer clusters , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[16]  Steven Hand,et al.  CIEL: A Universal Execution Engine for Distributed Data-Flow Computing , 2011, NSDI.

[17]  Mohammad Ali Maddah-Ali,et al.  Coded MapReduce , 2015, 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[18]  Mohammad Ali Maddah-Ali,et al.  Edge-Facilitated Wireless Distributed Computing , 2016, 2016 IEEE Global Communications Conference (GLOBECOM).

[19]  Daniela Tuninetti,et al.  On the optimality of uncoded cache placement , 2015, 2016 IEEE Information Theory Workshop (ITW).

[20]  Kannan Ramchandran,et al.  Speeding Up Distributed Machine Learning Using Codes , 2015, IEEE Transactions on Information Theory.

[21]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[22]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.