Experimental Analysis in Hadoop MapReduce: A Closer Look at Fault Detection and Recovery Techniques

Hadoop MapReduce reactively detects and recovers faults after they occur based on the static heartbeat detection and the re-execution from scratch techniques. However, these techniques lead to excessive response time penalties and inefficient resource consumption during detection and recovery. Existing fault-tolerance solutions intend to mitigate the limitations without considering critical conditions such as fail-slow faults, the impact of faults at various infrastructure levels and the relationship between the detection and recovery stages. This paper analyses the response time under two main conditions: fail-stop and fail-slow, when they manifest with node, service, and the task at runtime. In addition, we focus on the relationship between the time for detecting and recovering faults. The experimental analysis is conducted on a real Hadoop cluster comprising MapReduce, YARN and HDFS frameworks. Our analysis shows that the recovery of a single fault leads to an average of 67.6% response time penalty. Even though the detection and recovery times are well-turned, data locality and resource availability must also be considered to obtain the optimum tolerance time and the lowest penalties.

[1]  Hao Zhu,et al.  Adaptive Failure Detection via Heartbeat under Hadoop , 2011, 2011 IEEE Asia-Pacific Services Computing Conference.

[2]  T. S. Eugene Ng,et al.  Understanding the effects and implications of compute node related failures in hadoop , 2012, HPDC '12.

[3]  Hongwei Liu,et al.  Improving Fault Diagnosis Performance Using Hadoop MapReduce for Efficient Classification and Analysis of Large Data Sets , 2018 .

[4]  Carl E. Landwehr,et al.  Basic concepts and taxonomy of dependable and secure computing , 2004, IEEE Transactions on Dependable and Secure Computing.

[5]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[6]  Dhiraj K. Pradhan,et al.  Roll-Forward and Rollback Recovery: Performance-Reliability Trade-Off , 1997, IEEE Trans. Computers.

[7]  Yanpei Chen,et al.  Interactive Analytical Processing in Big Data Systems: A Cross-Industry Study of MapReduce Workloads , 2012, Proc. VLDB Endow..

[8]  Jaspal Subhlok,et al.  Performance Implications of Failures on MapReduce Applications , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).

[9]  William H. Sanders,et al.  Failure scenario as a service (FSaaS) for Hadoop clusters , 2012, SDMCMM '12.

[10]  Haiying Shen,et al.  A Low-Cost Multi-failure Resilient Replication Scheme for High Data Availability in Cloud Storage , 2016, 2016 IEEE 23rd International Conference on High Performance Computing (HiPC).

[11]  Guangxia Xu,et al.  A Novel Configuration Tuning Method Based on Feature Selection for Hadoop MapReduce , 2020, IEEE Access.

[12]  Long Wang,et al.  Fast Recovery MapReduce (FAR-MR) to accelerate failure recovery in big data applications , 2018, The Journal of Supercomputing.

[13]  Sanjay Misra,et al.  Network Intrusion Detection with a Hashing Based Apriori Algorithm Using Hadoop MapReduce , 2019, Comput..

[14]  Mayank Bansal,et al.  Astro: A predictive model for anomaly detection and feedback-based scheduling on Hadoop , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[15]  Maria Toeroe,et al.  Availability in the cloud: State of the art , 2016, J. Netw. Comput. Appl..

[16]  María S. Pérez-Hernández,et al.  Failure detector abstractions for MapReduce-based systems , 2017, Inf. Sci..

[17]  Quan Chen,et al.  SAMR: A Self-adaptive MapReduce Scheduling Algorithm in Heterogeneous Environment , 2010, 2010 10th IEEE International Conference on Computer and Information Technology.

[18]  R. Katz,et al.  A Methodology for Understanding MapReduce Performance Under Diverse Workloads , 2010 .

[19]  Sofiène Tahar,et al.  ATLAS: An AdapTive faiLure-Aware Scheduler for Hadoop , 2015, 2015 IEEE 34th International Performance Computing and Communications Conference (IPCCC).

[20]  Bernard Girau,et al.  Fault and Error Tolerance in Neural Networks: A Review , 2017, IEEE Access.

[21]  José A. B. Fortes,et al.  Towards self‐caring MapReduce: a study of performance penalties under faults , 2015, Concurr. Comput. Pract. Exp..

[22]  Yun Tian,et al.  Improving MapReduce performance through data placement in heterogeneous Hadoop clusters , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[23]  Babar Nazir,et al.  Correction to: Analysis and implementation of reactive fault tolerance techniques in Hadoop: a comparative study , 2021, J. Supercomput..

[24]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[25]  Keqin Li,et al.  McTAR: A Multi-Trigger Checkpointing Tactic for Fast Task Recovery in MapReduce , 2021, IEEE Transactions on Services Computing.

[26]  Ranjan Kumar Behera,et al.  Distributed Centrality Analysis of Social Network Data Using MapReduce , 2019, Algorithms.

[27]  Muthu Dayalan,et al.  MapReduce : Simplified Data Processing on Large Cluster , 2018 .

[28]  Haibo Hu,et al.  MapReduce Parallel Programming Model: A State-of-the-Art Survey , 2015, International Journal of Parallel Programming.

[29]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[30]  Andrea Rosà,et al.  Catching failures of failures at big-data clusters: A two-level neural network approach , 2015, 2015 IEEE 23rd International Symposium on Quality of Service (IWQoS).

[31]  Sofiène Tahar,et al.  A Dynamic and Failure-Aware Task Scheduling Framework for Hadoop , 2020, IEEE Transactions on Cloud Computing.

[32]  Laurent Lefèvre,et al.  Fault tolerance for highly available internet services: concepts, approaches, and issues , 2008, IEEE Communications Surveys & Tutorials.

[33]  Jorge-Arnulfo Quiané-Ruiz,et al.  RAFTing MapReduce: Fast recovery on the RAFT , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[34]  Bahman Javadi,et al.  Cloud storage reliability for Big Data applications: A state of the art survey , 2017, J. Netw. Comput. Appl..

[35]  José A. B. Fortes,et al.  Fault Management in Map-Reduce Through Early Detection of Anomalous Nodes , 2013, ICAC.

[36]  Babar Nazir,et al.  Analysis and implementation of reactive fault tolerance techniques in Hadoop: a comparative study , 2021, J. Supercomput..

[37]  Gabriel Antoniu,et al.  Enabling fast failure recovery in shared Hadoop clusters: Towards failure-aware scheduling , 2017, Future Gener. Comput. Syst..