Improving reliability and performances in large scale distributed applications with erasure codes and replication

Replication of Data Blocks is one of the main technologies on which Storage Systems in Cloud Computing and Big Data Applications are based. With the heterogeneity of nodes, and an always-changing topology, keeping the reliability of the data contained in the common large-scale distributed file system is an important research challenge. Common approaches are based either on replication of data or erasure codes. The former stores each data block several times in different nodes of the considered infrastructures: the drawback is that this can lead to large overhead and non-optimal resources utilization. Erasure coding instead exploits Maximum Distance Separable codes that minimize the information required to restore blocks in case of node failure: this approach can lead to increased complexity and transfer time due to the fact that several blocks, coming from different sources, are required to reconstruct lost information. In this paper we study, by means of discrete event simulation, the performances that can be obtained by combining both techniques, with the goal of minimizing the overhead and increasing the reliability while keeping the performances. The analysis proves that a careful balance between the application of replication and erasure codes significantly improves reliability and performances avoiding large overheads with respect to the isolated use of replication and redundancy. We evaluate the performances of mixed erasure coding/replication allocation schemes.We model architectures with massively distributed storage.We show the effects of the different parameters on the performances of the allocation technique.

[1]  Wei Chen,et al.  On the Impact of Replica Placement to the Reliability of Distributed Brick Storage Systems , 2005, 25th IEEE International Conference on Distributed Computing Systems (ICDCS'05).

[2]  Vaneet Aggarwal,et al.  Joint latency and cost optimization for erasurecoded data center storage , 2014, PERV.

[3]  Mauro Iacono,et al.  Modeling and Evaluating the Effects of Big Data Storage Resource Allocation in Global Scale Cloud Architectures , 2016, Int. J. Data Warehous. Min..

[4]  Philippe Robert,et al.  Scattering and Placing Data Replicas to Enhance Long-Term Durability , 2015, 2015 IEEE 14th International Symposium on Network Computing and Applications.

[5]  Alma Riska,et al.  Fast Eventual Consistency with Performance Guarantees for Distributed Storage , 2012, 2012 32nd International Conference on Distributed Computing Systems Workshops.

[6]  Roy Friedman,et al.  Replicated erasure codes for storage and repair-traffic efficiency , 2014, 14-th IEEE International Conference on Peer-to-Peer Computing.

[7]  Mauro Iacono,et al.  Performance evaluation of NoSQL big-data applications using multi-formalism models , 2014, Future Gener. Comput. Syst..

[8]  Marcos K. Aguilera,et al.  Using erasure codes efficiently for storage in a distributed system , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[9]  GhemawatSanjay,et al.  The Google file system , 2003 .

[10]  Ju Wang,et al.  Windows Azure Storage: a highly available cloud storage service with strong consistency , 2011, SOSP.

[11]  Mauro Iacono,et al.  Modeling performances of concurrent big data applications , 2015, Softw. Pract. Exp..

[12]  Gregory R. Ganger,et al.  Agility and Performance in Elastic Distributed Storage , 2014, TOS.

[13]  Valentin Cristea,et al.  Resource-aware hybrid scheduling algorithm in heterogeneous distributed computing , 2015, Future Gener. Comput. Syst..

[14]  Ramesh K. Sitaraman,et al.  The power of two random choices: a survey of tech-niques and results , 2001 .

[15]  Florin Pop,et al.  Asymptotic scheduling for many task computing in Big Data platforms , 2015, Inf. Sci..

[16]  James S. Plank,et al.  A tutorial on Reed–Solomon coding for fault‐tolerance in RAID‐like systems , 1997, Softw. Pract. Exp..

[17]  Abdulhalim Dandoush,et al.  Simulation analysis of download and recovery processes in P2P storage systems , 2009, 2009 21st International Teletraffic Congress.

[18]  Rodrigo Rodrigues,et al.  High Availability in DHTs: Erasure Coding vs. Replication , 2005, IPTPS.

[19]  Dimitris S. Papailiopoulos,et al.  XORing Elephants: Novel Erasure Codes for Big Data , 2013, Proc. VLDB Endow..

[20]  Mauro Iacono,et al.  Modeling and analysis of performances for concurrent multithread applications on multicore and graphics processing unit systems , 2016, Concurr. Comput. Pract. Exp..

[21]  Chen Gui Redundancy Schemes for High Availability in DHTs , 2008 .

[22]  Christian Esposito,et al.  Smart Cloud Storage Service Selection Based on Fuzzy Logic, Theory of Evidence and Game Theory , 2016, IEEE Transactions on Computers.

[23]  Lee Chao,et al.  Windows Azure Storage , 2013 .

[24]  Mauro Iacono,et al.  Exploiting mean field analysis to model performances of big data architectures , 2014, Future Gener. Comput. Syst..

[25]  John Kubiatowicz,et al.  Erasure Coding Vs. Replication: A Quantitative Comparison , 2002, IPTPS.

[26]  Yuichi Sato,et al.  Erasure Codes with Small Overhead Factor and Their Distributed Storage Applications , 2007, 2007 41st Annual Conference on Information Sciences and Systems.

[27]  Antonio Puliafito,et al.  Information dependability in distributed systems: The dependable distributed storage system , 2014, Integr. Comput. Aided Eng..