Optimal Capacity Allocation for Executing MapReduce Jobs in Cloud Systems

Nowadays, analyzing large amount of data is of paramount importance for many companies. Big data and business intelligence applications are facilitated by the MapReduce programming model while, at infrastructural layer, cloud computing provides flexible and cost effective solutions for allocating on demand large clusters. Capacity allocation in such systems is a key challenge to providing performance for MapReduce jobs and minimize cloud resource cost. The contribution of this paper is twofold: (i) we formulate a linear programming model able to minimize cloud resources cost and job rejection penalties for the execution of jobs of multiple classes with (soft) deadline guarantees, (ii) we provide new upper and lower bounds for MapReduce job execution time in shared Hadoop clusters. Moreover, our solutions are validated by a large set of experiments. We demonstrate that our method is able to determine the global optimal solution for systems including up to 1000 user classes in less than 0.5 seconds. Moreover, the execution time of MapReduce jobs are within 19% of our upper bounds on average.

[1]  Magdalena Balazinska,et al.  Estimating the progress of MapReduce pipelines , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[2]  Daniel A. Menascé,et al.  Queuing Network Models to Predict the Completion Time of the Map Phase of MapReduce Jobs , 2012, Int. CMG Conference.

[3]  Mauro Iacono,et al.  Exploiting mean field analysis to model performances of big data architectures , 2014, Future Gener. Comput. Syst..

[4]  Magdalena Balazinska,et al.  ParaTimer: a progress indicator for MapReduce DAGs , 2010, SIGMOD Conference.

[5]  Insup Lee,et al.  An empirical analysis of scheduling techniques for real-time cloud-based data processing , 2011, 2011 IEEE International Conference on Service-Oriented Computing and Applications (SOCA).

[6]  Boon Thau Loo,et al.  Exploiting cloud heterogeneity for optimized cost/performance MapReduce processing , 2014, CloudDP '14.

[7]  Minghong Lin,et al.  Joint optimization of overlapping phases in MapReduce , 2013, PERV.

[8]  Danilo Ardagna,et al.  Generalized Nash equilibria for SaaS/PaaS Clouds , 2014, Eur. J. Oper. Res..

[9]  Roy H. Campbell,et al.  ARIA: automatic resource inference and allocation for mapreduce environments , 2011, ICAC '11.

[10]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[11]  Boon Thau Loo,et al.  Automated profiling and resource management of pig programs for meeting service level objectives , 2012, ICAC '12.

[12]  Barbara Panicucci,et al.  Generalized Nash Equilibria for the Service Provisioning Problem in Cloud Systems , 2013, IEEE Transactions on Services Computing.

[13]  Evgenia Smirni,et al.  Heterogeneous cores for MapReduce processing: Opportunity or challenge? , 2014, 2014 IEEE Network Operations and Management Symposium (NOMS).

[14]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[15]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[16]  Roy H. Campbell,et al.  Resource Provisioning Framework for MapReduce Jobs with Performance Goals , 2011, Middleware.

[17]  Quanyan Zhu,et al.  Dynamic Service Placement in Geographically Distributed Clouds , 2012, IEEE Journal on Selected Areas in Communications.

[18]  Yon Dohn Chung,et al.  Parallel data processing with MapReduce: a survey , 2012, SGMD.

[19]  Keke Chen,et al.  Towards Optimal Resource Provisioning for Running MapReduce Programs in Public Clouds , 2011, 2011 IEEE 4th International Conference on Cloud Computing.

[20]  Kevin Wilkinson,et al.  Analytical Performance Models for MapReduce Workloads , 2012, International Journal of Parallel Programming.