On the Feasibility of Byzantine Fault-Tolerant MapReduce in Clouds-of-Clouds

MapReduce is a framework for processing large data sets largely used in cloud computing. MapReduce implementations like Hadoop can tolerate crashes and file corruptions, but there is evidence that general arbitrary faults do occur and can affect the correctness of job executions. Furthermore, many individual cloud outages have been reported, raising concerns about depending on a single cloud. We present a MapReduce runtime that tolerates arbitrary faults and runs in a set of clouds at a reasonable cost in terms of computation and execution time. The main challenge is to avoid sending through the internet the huge amount of data that would normally be exchanged between map and reduce tasks.

[1]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[2]  Hari Balakrishnan,et al.  Resilient overlay networks , 2001, SOSP.

[3]  Sangmin Lee,et al.  Upright cluster services , 2009, SOSP '09.

[4]  Luis F. G. Sarmenta Sabotage-tolerance mechanisms for volunteer computing systems , 2002, Future Gener. Comput. Syst..

[5]  John R. Douceur,et al.  Cycles, cells and platters: an empirical analysisof hardware failures on a million consumer PCs , 2011, EuroSys '11.

[6]  DahlinMike,et al.  Separating agreement from execution for byzantine fault tolerant services , 2003 .

[7]  Geoffrey C. Fox,et al.  MapReduce in the Clouds for Science , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[8]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[9]  Eduardo Pinheiro,et al.  DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.

[10]  Miguel Castro,et al.  Practical byzantine fault tolerance and proactive recovery , 2002, TOCS.

[11]  B SchneiderFred Implementing fault-tolerant services using the state machine approach: a tutorial , 1990 .

[12]  Jonathan Kirsch,et al.  Scaling Byzantine Fault-Tolerant Replication toWide Area Networks , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[13]  GhemawatSanjay,et al.  The Google file system , 2003 .

[14]  Carl E. Landwehr,et al.  Basic concepts and taxonomy of dependable and secure computing , 2004, IEEE Transactions on Dependable and Secure Computing.

[15]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[16]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[17]  Alysson Neves Bessani,et al.  The TClouds architecture: Open and resilient cloud-of-clouds computing , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012).

[18]  Miguel Correia,et al.  Byzantine Fault-Tolerant MapReduce: Faults are Not Just Crashes , 2011, 2011 IEEE Third International Conference on Cloud Computing Technology and Science.

[19]  Miguel Correia,et al.  EBAWA: Efficient Byzantine Agreement for Wide-Area Networks , 2010, 2010 IEEE 12th International Symposium on High Assurance Systems Engineering.

[20]  Christoforos E. Kozyrakis,et al.  Evaluating MapReduce for Multi-core and Multiprocessor Systems , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[21]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.