Large LDPC Codes for Big Data Storage

Current distributed storage systems mainly rely on data replication to ensure certain level of data availability and reliability. But in scenarios, like data archiving, replication is not cost effective and does not provides a robust solution to prevent data loss. A recent trend is to introduce erasure codes into the distributed storage. Inspired by the RAID system, early attempts have been focused on designing Reed-Solomon (R-S) based solutions with small block sizes. This paper investigates in details about repair traffic to apply Low Density Parity Check (LDPC) codes with relatively large block sizes. It has been demonstrated that the LDPC codes have unique advantages over R-S based solutions including low repair traffic for multiple erasures and parity erasures. The LDPC-based method is integrated with the Hadoop system with various configurations. Both theoretical analysis and simulations show that significant improvement in reliability can be achieved through using large LDPC codes without increasing the repair latency and network traffic especially for multiple erasures. Simulations also show great improvement in terms repairing latencies compared with Reed-Solomon codes. The latency is further improved through parallelism by engaging map-reduce processes from Hadoop.

[1]  Lei Shi,et al.  Dcell: a scalable and fault-tolerant network structure for data centers , 2008, SIGCOMM '08.

[2]  Kannan Ramchandran,et al.  Having Your Cake and Eating It Too: Jointly Optimal Erasure Codes for I/O, Storage, and Network-bandwidth , 2015, FAST.

[3]  Yunnan Wu,et al.  Network coding for distributed storage systems , 2010, IEEE Trans. Inf. Theory.

[4]  Xiaozhou Li,et al.  Flat XOR-based erasure codes in storage systems: Constructions, efficient recovery, and tradeoffs , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[5]  Yongmei Wei,et al.  The Auto-configurable LDPC Codes for Distributed Storage , 2014, 2014 IEEE 17th International Conference on Computational Science and Engineering.

[6]  Garth A. Gibson,et al.  DiskReduce: RAID for data-intensive scalable computing , 2009, PDSW '09.

[7]  Dimitris S. Papailiopoulos,et al.  Simple regenerating codes: Network coding for cloud storage , 2011, 2012 Proceedings IEEE INFOCOM.

[8]  Daniel J. Costello,et al.  LDPC block and convolutional codes based on circulant matrices , 2004, IEEE Transactions on Information Theory.

[9]  Van-Anh Truong,et al.  Availability in Globally Distributed Storage Systems , 2010, OSDI.

[10]  Frédérique E. Oggier,et al.  An overview of codes tailor-made for better repairability in networked distributed storage systems , 2013, SIGA.

[11]  Nuno Santos,et al.  Exploring high performance distributed file storage using LDPC codes , 2006, Parallel Comput..

[12]  Lluis Pamies-Juarez,et al.  CORE: Cross-object redundancy for efficient data repair in storage systems , 2013, 2013 IEEE International Conference on Big Data.

[13]  Konstantin V. Shvachko,et al.  HDFS Scalability: The Limits to Growth , 2010, login Usenix Mag..

[14]  Peter Sobe Pre-calculated equation-based decoding in failure-tolerant distributed storage , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[15]  Dimitris S. Papailiopoulos,et al.  XORing Elephants: Novel Erasure Codes for Big Data , 2013, Proc. VLDB Endow..

[16]  Kannan Ramchandran,et al.  A Solution to the Network Challenges of Data Recovery in Erasure-coded Distributed Storage Systems: A Study on the Facebook Warehouse Cluster , 2013, HotStorage.