Improving MapReduce fault tolerance in the cloud

MapReduce has been used at Google, Yahoo, FaceBook etc., even for their production jobs. However, according to a recent study, a single failure on a Hadoop job could cause a 50% increase in completion time. Amazon Elastic MapReduce has been provided to help users perform data-intensive tasks for their applications. These applications may have high fault tolerance and/or tight SLA requirements. However, MapReduce fault tolerance in the cloud is more challenging as topology control and (data) rack locality currently are not possible. In this paper, we investigate how redundent copies can be provisioned for tasks to improve MapReduce fault tolerance in the cloud while reducing latency.

[1]  Jeffrey Dean,et al.  Designs, Lessons and Advice from Building Large Distributed Systems , 2009 .

[2]  Indranil Gupta,et al.  On Availability of Intermediate Data in Cloud Computations , 2009, HotOS.

[3]  Guanying Wang,et al.  A simulation approach to evaluating design decisions in MapReduce setups , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[4]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[5]  Jeffrey Dean,et al.  Keynote talk: Experiences with MapReduce, an abstraction for large-scale computation , 2006, 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[6]  GhemawatSanjay,et al.  The Google file system , 2003 .

[7]  Matei Zaharia,et al.  Job Scheduling for Multi-User MapReduce Clusters , 2009 .