UMR-EC: A Unified and Multi-Rail Erasure Coding Library for High-Performance Distributed Storage Systems

Distributed storage systems typically need data to be stored redundantly to guarantee data durability and reliability. While the conventional approach towards this objective is to store multiple replicas, today's unprecedented data growth rates encourage modern distributed storage systems to employ Erasure Coding (EC) techniques, which can achieve better storage efficiency. Various hardware-based EC schemes have been proposed in the community to leverage the advanced compute capabilities on modern data center and cloud environments. Currently, there is no unified and easy way for distributed storage systems to fully exploit multiple devices such as CPUs, GPUs, and network devices (i.e., multi-rail support) to perform EC operations in parallel; thus, leading to the under-utilization of the available compute power. In this paper, we first introduce an analytical model to analyze the design scope of efficient EC schemes in distributed storage systems. Guided by the performance model, we propose UMR-EC, a Unified and Multi-Rail Erasure Coding library that can fully exploit heterogeneous EC coders. Our proposed interface is complemented by asynchronous semantics with optimized metadata-free scheme and EC rate-aware task scheduling that can enable a highly-efficient I/O pipeline. To show the benefits and effectiveness of UMR-EC, we re-design HDFS 3.x write/read pipelines based on the guidelines observed in the proposed performance model. Our performance evaluations show that our proposed designs can outperform the write performance of replication schemes and the default HDFS EC coder by 3.7x - 6.1x and 2.4x - 3.3x, respectively, and can improve the performance of read with failure recoveries up to 5.1x compared with the default HDFS EC coder. Compared with the fastest available CPU coder (i.e., ISA-L), our proposed designs have an improvement of up to 66.0% and 19.4% for write and read with failure recoveries, respectively.

[1]  Jack J. Dongarra,et al.  Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems , 2012, ICS '12.

[2]  Kannan Ramchandran,et al.  A “Hitchhiker’s” Guide to Fast and Efficient Data Reconstruction in Erasure-coded Data Centers , 2014 .

[3]  Catherine D. Schuman,et al.  A Performance Evaluation and Examination of Open-Source Erasure Coding Libraries for Storage , 2009, FAST.

[4]  Patrick P. C. Lee,et al.  Repair Pipelining for Erasure-Coded Storage , 2017, USENIX Annual Technical Conference.

[5]  J. Plank Optimizing Cauchy Reed-Solomon Codes for Fault-Tolerant Storage Applications , 2005 .

[6]  Ben Y. Zhao,et al.  OceanStore: an architecture for global-scale persistent storage , 2000, SIGP.

[7]  Cheng Huang,et al.  Erasure Coding in Windows Azure Storage , 2012, USENIX Annual Technical Conference.

[8]  Cheng Huang,et al.  Giza: Erasure Coding Objects across Global Data Centers , 2017, USENIX Annual Technical Conference.

[9]  Kannan Ramchandran,et al.  Having Your Cake and Eating It Too: Jointly Optimal Erasure Codes for I/O, Storage, and Network-bandwidth , 2015, FAST.

[10]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[11]  Enhong Chen,et al.  Multi-Path Transport for RDMA in Datacenters , 2018, NSDI.

[12]  David E. Culler,et al.  SEDA: an architecture for well-conditioned, scalable internet services , 2001, SOSP.

[13]  Dimitris S. Papailiopoulos,et al.  XORing Elephants: Novel Erasure Codes for Big Data , 2013, Proc. VLDB Endow..

[14]  Kannan Ramchandran,et al.  A Solution to the Network Challenges of Data Recovery in Erasure-coded Distributed Storage Systems: A Study on the Facebook Warehouse Cluster , 2013, HotStorage.

[15]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[16]  Dan Dobre,et al.  Hybris: Robust Hybrid Cloud Storage , 2014, SoCC.

[17]  Dhabaleswar K. Panda,et al.  High-Performance and Resilient Key-Value Store with Online Erasure Coding for Big Data Workloads , 2017, 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS).

[18]  Ethan L. Miller,et al.  Screaming fast Galois field arithmetic using intel SIMD instructions , 2013, FAST.

[19]  Rodrigo Rodrigues,et al.  High Availability in DHTs: Erasure Coding vs. Replication , 2005, IPTPS.

[20]  Dhabaleswar K. Panda,et al.  Building Multirail InfiniBand Clusters: MPI-Level Design and Performance Evaluation , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[21]  Jason Cong,et al.  Atlas: Baidu's key-value storage system for cloud data , 2015, 2015 31st Symposium on Mass Storage Systems and Technologies (MSST).

[22]  Alexandros G. Dimakis,et al.  Network Coding for Distributed Storage Systems , 2007, IEEE INFOCOM 2007 - 26th IEEE International Conference on Computer Communications.

[23]  Fei Meng,et al.  Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[24]  GhemawatSanjay,et al.  The Google file system , 2003 .

[25]  F. Moore,et al.  Polynomial Codes Over Certain Finite Fields , 2017 .

[26]  John Kubiatowicz,et al.  Erasure Coding Vs. Replication: A Quantitative Comparison , 2002, IPTPS.

[27]  Saurabh Bagchi,et al.  Partial-parallel-repair (PPR): a distributed technique for repairing erasure coded storage , 2016, EuroSys.

[28]  Dhabaleswar K. Panda,et al.  High-Performance Multi-Rail Erasure Coding Library over Modern Data Center Architectures: Early Experiences , 2018, SoCC.

[29]  Mario Blaum,et al.  A Tale of Two Erasure Codes in HDFS , 2015, FAST.

[30]  Andrey Fedorov,et al.  Optimization of RAID Erasure Coding Algorithms for Intel Xeon Phi , 2016, 2016 IEEE International Conference on Networking, Architecture and Storage (NAS).

[31]  Anthony Skjellum,et al.  Gibraltar: A Reed‐Solomon coding library for storage applications on programmable graphics processors , 2011, Concurr. Comput. Pract. Exp..

[32]  Kannan Ramchandran,et al.  EC-Cache: Load-Balanced, Low-Latency Cluster Caching with Online Erasure Coding , 2016, OSDI.

[33]  Cory Hill,et al.  f4: Facebook's Warm BLOB Storage System , 2014, OSDI.

[34]  Miguel Castro,et al.  Farsite: federated, available, and reliable storage for an incompletely trusted environment , 2002, OPSR.

[35]  Dhabaleswar K. Panda,et al.  A scalable and portable approach to accelerate hybrid HPL on heterogeneous CPU-GPU clusters , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).

[36]  Yang Tang,et al.  NCCloud: applying network coding for the storage repair in a cloud-of-clouds , 2012, FAST.

[37]  Sriram Rao,et al.  A The Quantcast File System , 2013, Proc. VLDB Endow..

[38]  Heng Zhang,et al.  Efficient and Available In-Memory KV-Store with Hybrid Erasure Coding and Replication , 2016, FAST.