Model-Driven Autoscaling for Hadoop Clusters

In this paper, we present the design and implementation of a model-driven auto scaling solution for Hadoop clusters. We first develop novel performance models for Hadoop workloads that relate job completion times to various workload and system parameters such as input size and resource allocation. We then employ statistical techniques to tune the models for specific workloads, including Terasort and K-means. Finally, we employ the tuned models to determine the resources required to successfully complete the Hadoop jobs as per the user-specified response time SLA. We implement our solution on an Open Stack-based cloud cluster running Hadoop. Our experimental results across different workloads demonstrate the auto scaling capabilities of our solution, and enable significant resource savings without compromising performance.

[1]  Parijat Dube,et al.  Modeling the Impact of Workload on Cloud Resource Scaling , 2014, 2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing.

[2]  Alexandru Iosup,et al.  Balanced resource allocations across multiple dynamic MapReduce clusters , 2014, SIGMETRICS '14.

[3]  Parijat Dube,et al.  Adaptive, Model-driven Autoscaling for Cloud Applications , 2014, ICAC.

[4]  Roy H. Campbell,et al.  ARIA: automatic resource inference and allocation for mapreduce environments , 2011, ICAC '11.

[5]  Liang Dong,et al.  Starfish: A Self-tuning System for Big Data Analytics , 2011, CIDR.

[6]  Mingfa Zhu,et al.  MIMP: Deadline and Interference Aware Scheduling of Hadoop Virtual Machines , 2014, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.