Performance impact of JobTracker failure in Hadoop

In this paper, we analyze the performance impact of JobTracker failure in Hadoop. A JobTracker failure is a serious problem that affects the overall job processing performance. We describe the cause of failure and the system behaviors because of failed job processing in the Hadoop. On the basis of the analysis, we build a job completion time model that reflects failure effects. Our model is based on a stochastic process with a node crash probability. With our model, we run simulation of performance impact with very credible failure data available from USENIX called computer failure data repository that have been collected for past 9years. The results show that the performance impact is very severe in that the job completion time increases about four times typically, and in a worst case, it increases up to 68 times. Copyright © 2014 John Wiley & Sons, Ltd.

[1]  Bianca Schroeder,et al.  The Computer Failure Data Repository (CFDR): collecting, sharing and analyzing failure data , 2006, SC.

[2]  I-Ching Hsu,et al.  Multilayer context cloud framework for mobile Web 2.0: a proposed infrastructure , 2013, Int. J. Commun. Syst..

[3]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[4]  Eunmi Choi,et al.  A service-oriented taxonomical spectrum, cloudy challenges and opportunities of cloud computing , 2012, Int. J. Commun. Syst..

[5]  Garth A. Gibson,et al.  The Computer Failure Data Repository ( CFDR ) , 2006 .

[6]  Guanying Wang,et al.  Towards Synthesizing Realistic Workload Traces for Studying the Hadoop Ecosystem , 2011, 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems.

[7]  Sangmin Lee,et al.  Upright cluster services , 2009, SOSP '09.

[8]  Guanying Wang,et al.  Using realistic simulation for performance analysis of mapreduce setups , 2009, LSAP '09.

[9]  Murali S. Kodialam,et al.  Scheduling in mapreduce-like systems for fast completion time , 2011, 2011 Proceedings IEEE INFOCOM.

[10]  GhemawatSanjay,et al.  The Google file system , 2003 .

[11]  Kashi Venkatesh Vishwanath,et al.  Characterizing cloud computing hardware reliability , 2010, SoCC '10.

[12]  Kenichi Hagihara,et al.  Evolving fault-tolerance in Hadoop with robust auto-recovering JobTracker , 2013 .

[13]  Mark S. Squillante,et al.  Failure data analysis of a large-scale heterogeneous server environment , 2004, International Conference on Dependable Systems and Networks, 2004.

[14]  Pedro de Botelho Marcos Maresia : an approach to deal with the single points of failure of the MapReduce model , 2013 .

[15]  Guanying Wang,et al.  A simulation approach to evaluating design decisions in MapReduce setups , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[16]  Rajeev Gandhi,et al.  An Analysis of Traces from a Production MapReduce Cluster , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[17]  Goran T. Djordjevic,et al.  Performance analysis of dual switched diversity over correlated Weibull fading channels with co-channel interference , 2011, Int. J. Commun. Syst..

[18]  Chuck Yoo,et al.  Isolation Schemes of Virtual Network Platform for Cloud Computing , 2012, KSII Trans. Internet Inf. Syst..

[19]  Sara Bouchenak,et al.  MRBS: Towards Dependability Benchmarking for Hadoop MapReduce , 2012, Euro-Par Workshops.

[20]  Albert Y. Zomaya,et al.  A study on using uncertain time series matching algorithms for MapReduce applications , 2013, Concurr. Comput. Pract. Exp..

[21]  Bo Dong,et al.  Hadoop high availability through metadata replication , 2009, CloudDB@CIKM.

[22]  Depei Qian,et al.  MapReduce Workload Modeling with Statistical Approach , 2011, Journal of Grid Computing.

[23]  YooChuck,et al.  Performance impact of JobTracker failure in Hadoop , 2015 .

[24]  T. S. Eugene Ng,et al.  Understanding the effects and implications of compute node related failures in hadoop , 2012, HPDC '12.

[25]  Bianca Schroeder,et al.  Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You? , 2007, FAST.

[26]  Eric R. Ziegel,et al.  System Reliability Theory: Models, Statistical Methods, and Applications , 2004, Technometrics.

[27]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[28]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[29]  Jeffrey Dean,et al.  Evolution and future directions of large-scale storage and computation systems at Google , 2010, SoCC '10.

[30]  Keke Chen,et al.  Towards Optimal Resource Provisioning for Running MapReduce Programs in Public Clouds , 2011, 2011 IEEE 4th International Conference on Cloud Computing.