Optimal Tradeoff between Energy Consumption and Response Time in Large-Scale MapReduce Clusters

The increasing growth of the size of the digital databases has given rise to the need for the development of infrastructures, such as large scale data centers and computational clusters, which are capable of storing and processing very large volumes of data. To date, most clusters have been designed for performance. Due to non-linear speed-ups that are common to typical applications, performance maximization involves the decision of the number of the nodes to process a specific (intensive) task, as opposed to the utilization of the full cluster. In addition, energy consumption has recently attracted significant attention, given that the cost to operate a cluster may well exceed its acquisition cost. This issue calls for judicious use of resources as well. The aim of this study is to present a method that achieves the optimal tradeoff between energy consumption and response time in distributed clusters, such as Map Reduce clusters. To this end, we propose an algorithm that derives the fraction of the nodes that minimizes the energy consumption without sacrificing performance (in terms of response time) more than a user-defined threshold. Moreover, we present a generic and configurable framework to describe performance and energy consumption as a function of the nodes used, our framework can accommodate the widely spread Map Reduce-like parallel executions in a straightforward manner. The evaluation results show that our methodology can lead to significant energy savings with acceptable performance penalty in many realistic situations.