Efficiently Scheduling Hadoop Cluster in Cloud Environment

Today, most of the real-time applications like bioinformatics and image processing involve processing of large amounts of unstructured data that requires fast, memory-consuming, and highly efficient resources. This problem has been resolved by the introduction of cloud, which is now the most favored option for big-data analytics. Hadoop, a framework for manipulating unstructured data, is used for this purpose. The nodes that form the Hadoop cluster are scheduled randomly in Amazon cloud. Since huge amounts of data need to be transferred among these nodes, the time taken to upload and process the data is quite high, thereby decreasing the performance. The further focus of service providers is on maximizing resource utilization and minimizing power consumption. This chapter aims at designing an energy-efficient scheduler for a cloud environment that will be suitable for the big-data applications. The working of the scheduler has been tested in OpenStack cloud environment.