EC-Shuffle: Dynamic Erasure Coding Optimization for Efficient and Reliable Shuffle in Spark

Fault-tolerance capabilities attract increasing attention from existing data processing frameworks, such as Apache Spark. To avoid replaying costly distributed computation, like shuffle, local checkpoint and remote replication are two popular approaches. They incur significant runtime overhead, such as extra storage cost or network traffic. Erasure coding is another emerging technology, which also enables data resilience. It is perceived as capable of replacing the checkpoint and replication mechanisms for its high storage efficiency. However, it suffers heavy network traffic due to distributing data partitions to different locations. In this paper, we propose EC-Shuffle with two encoding schemes and optimize the shuffle-based operations in Spark or MapReduce-like frameworks. Specifically, our encoding schemes concentrate on optimizing the data traffic during the execution of shuffle operations. They only transfer the parity chunks generated via erasure coding, instead of a whole copy of all data chunks. EC-Shuffle also provides a strategy, which can dynamically select the per-shuffle biased encoding scheme according to the number of senders and receivers in each shuffle. Our analyses indicate that this dynamic encoding selection can minimize the total size of parity chunks. The extensive experimental results using BigDataBench with hundreds of mappers and reducers shows this optimization can reduce up to 50% network traffic and achieve up to 38% performance improvement.

[1]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[2]  Khushnood Abbas Movielens 20M Dataset , 2017 .

[3]  Mario Blaum,et al.  A Tale of Two Erasure Codes in HDFS , 2015, FAST.

[4]  Heng Zhang,et al.  Efficient and Available In-Memory KV-Store with Hybrid Erasure Coding and Replication , 2016, FAST.

[5]  Yuqing Zhu,et al.  BigDataBench: A big data benchmark suite from internet services , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[6]  Dhabaleswar K. Panda,et al.  System-Level Scalable Checkpoint-Restart for Petascale Computing , 2016, 2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS).

[7]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[8]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[9]  Cory Hill,et al.  f4: Facebook's Warm BLOB Storage System , 2014, OSDI.

[10]  Cheng Huang,et al.  Rethinking erasure codes for cloud file systems: minimizing I/O for recovery and degraded reads , 2012, FAST.

[11]  Garret Swart,et al.  Balancing reducer skew in MapReduce workloads using progressive sampling , 2012, SoCC '12.

[12]  Balaji Rajendran,et al.  A Survey of Storage Remote Replication Software , 2014, 2014 3rd International Conference on Eco-friendly Computing and Communication Systems.

[13]  John Kubiatowicz,et al.  Erasure Coding Vs. Replication: A Quantitative Comparison , 2002, IPTPS.

[14]  Brian Randell,et al.  System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.

[15]  Dahlia Malkhi,et al.  CORFU: A distributed shared log , 2013, TOCS.

[16]  Franck Cappello,et al.  FTI: High performance Fault Tolerance Interface for hybrid systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[17]  Haibo Chen,et al.  Replication-Based Fault-Tolerance for Large-Scale Graph Processing , 2018, IEEE Transactions on Parallel and Distributed Systems.

[18]  Cheng Huang,et al.  Erasure Coding in Windows Azure Storage , 2012, USENIX Annual Technical Conference.

[19]  Lenin Ravindranath,et al.  Nectar: Automatic Management of Data and Computation in Datacenters , 2010, OSDI.

[20]  Lihao Xu,et al.  An efficient XOR-scheduling algorithm for erasure codes encoding , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.

[21]  Thomas Hérault,et al.  MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[22]  Dimitris S. Papailiopoulos,et al.  XORing Elephants: Novel Erasure Codes for Big Data , 2013, Proc. VLDB Endow..

[23]  Jason Duell,et al.  The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing , 2005, Int. J. High Perform. Comput. Appl..

[24]  Xin Yao,et al.  SIMPO , 2018, ACM Trans. Archit. Code Optim..

[25]  Scott Shenker,et al.  Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks , 2014, SoCC.

[26]  Michael J. Freedman,et al.  Riffle: optimized shuffle service for large-scale data analytics , 2018, EuroSys.

[27]  Baochun Li,et al.  On Data Parallelism of Erasure Coding in Distributed Storage Systems , 2017, 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS).