Autoscaling for Hadoop Clusters

Unforeseen events such as node failures and resource contention can have a severe impact on the performance of data processing frameworks, such as Hadoop, especially in cloud environments where such incidents are common. SLA compliance in the presence of such events requires the ability to quickly and dynamically resize infrastructure resources. Unfortunately, the distributed and stateful nature of data processing frameworks makes it challenging to accurately scale the system at run-time. In this paper, we present the design and implementation of a model-driven autoscaling solution for Hadoop clusters. We first develop novel gray-box performance models for Hadoop workloads that specifically relate job execution times to resource allocation and workload parameters. We then employ these models to dynamically determine the resources required to successfully complete the Hadoop jobs as per the user-specified SLA under various scenarios including node failures and multi-job executions. Our experimental results on three different Hadoop cloud clusters and across different workloads demonstrate the efficacy of our models and highlight their autoscaling capabilities.

[1]  Carlo Curino,et al.  Reservation-based Scheduling: If You're Late Don't Blame Us! , 2014, SoCC.

[2]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[3]  Parijat Dube,et al.  Adaptive, Model-driven Autoscaling for Cloud Applications , 2014, ICAC.

[4]  Srikanth Kandula,et al.  Jockey: guaranteed job latency in data parallel clusters , 2012, EuroSys '12.

[5]  Alexandru Iosup,et al.  Balanced resource allocations across multiple dynamic MapReduce clusters , 2014, SIGMETRICS '14.

[6]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[7]  Keke Chen,et al.  CRESP: Towards Optimal Resource Provisioning for MapReduce Computing in Public Clouds , 2014, IEEE Transactions on Parallel and Distributed Systems.

[8]  Parijat Dube,et al.  The Unobservability Problem in Clouds , 2015, 2015 International Conference on Cloud and Autonomic Computing.

[9]  Jie Huang,et al.  The HiBench benchmark suite: Characterization of the MapReduce-based data analysis , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[10]  Prashant J. Shenoy,et al.  Empirical evaluation of latency-sensitive application performance in the cloud , 2010, MMSys '10.

[11]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[12]  Shengsheng Huang,et al.  HiBench : A Representative and Comprehensive Hadoop Benchmark Suite , 2012 .

[13]  Yonggang Hu,et al.  DynMR: dynamic MapReduce with ReduceTask interleaving and MapTask backfilling , 2014, EuroSys '14.

[14]  Liang Dong,et al.  Starfish: A Self-tuning System for Big Data Analytics , 2011, CIDR.

[15]  Dick H. J. Epema,et al.  Resource Management for Dynamic MapReduce Clusters in Multicluster Systems , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[16]  Magdalena Balazinska,et al.  ParaTimer: a progress indicator for MapReduce DAGs , 2010, SIGMOD Conference.

[17]  Roy H. Campbell,et al.  ARIA: automatic resource inference and allocation for mapreduce environments , 2011, ICAC '11.