A Cost-Effective Data Node Management Scheme for Hadoop Clusters in Cloud Environment

MapReduce framework in Hadoop is used to analyze the large set of data in a distributed storage system. MapReduce jobs are designate to the task node to perform the map-reduce operation based upon the scheduler. Each node has slots (virtual core) to process a task using the map and reduce operation. Map tasks done separately prior to the Reduce task. The different execution order of jobs and different slot configuration in the clusters affect the CPU performance significantly. In this paper, we have stated effective DataNode assignment techniques for resource allocation in the Hadoop MapReduce job. We performed various operations on Amazon EC2 and physical machine to demonstrate that our proposed technique helps to choose optimized node selection for assignment of DataNodes in the Hadoop cluster. This significantly scales down the cost of the node and increases the job execution performance in the Hadoop cluster.

[1]  Debarchan Sarkar Pro Microsoft HDInsight , 2014, Apress.

[2]  Bhavin J. Mathiya,et al.  Apache Hadoop Yarn Parameter configuration Challenges and Optimization , 2015, 2015 International Conference on Soft-Computing and Networks Security (ICSNS).

[3]  N. P. Gopalan,et al.  An Optimal Task Selection Scheme for Hadoop Scheduling , 2014 .

[4]  Christoforos E. Kozyrakis,et al.  On the energy (in)efficiency of Hadoop clusters , 2010, OPSR.

[5]  Pietro Michiardi,et al.  HFSP: Bringing Size-Based Scheduling To Hadoop , 2017, IEEE Transactions on Cloud Computing.

[6]  Anis Yazidi,et al.  Cost Efficient Batch Processing in Amazon Cloud with Deadline Awareness , 2017, 2017 IEEE 31st International Conference on Advanced Information Networking and Applications (AINA).

[7]  Durgaprasad Gangodkar,et al.  Hadoop, MapReduce and HDFS: A Developers Perspective☆ , 2015 .

[8]  Jeffrey D. Ullman,et al.  Assignment Problems of Different-Sized Inputs in MapReduce , 2015, ACM Trans. Knowl. Discov. Data.

[9]  Helen D. Karatza,et al.  Performance evaluation of cloud-based log file analysis with Apache Hadoop and Apache Spark , 2017, J. Syst. Softw..

[10]  Sanjay Chaudhary,et al.  A survey on job scheduling algorithms in Big data processing , 2015, 2015 IEEE International Conference on Electrical, Computer and Communication Technologies (ICECCT).

[11]  M. Kumar,et al.  Tolhit – A Scheduling Algorithm for Hadoop Cluster , 2016 .

[12]  Kwang Mong Sim,et al.  A comparative review of job scheduling for MapReduce , 2011, 2011 IEEE International Conference on Cloud Computing and Intelligence Systems.

[13]  Kyoung Soo Bok,et al.  An efficient MapReduce scheduling scheme for processing large multimedia data , 2016, Multimedia Tools and Applications.

[14]  Derong Shen,et al.  A Throughput Driven Task Scheduler for Improving MapReduce Performance in Job-Intensive Environments , 2013, 2013 IEEE International Congress on Big Data.

[15]  Ishwarappa,et al.  A Brief Introduction on Big Data 5Vs Characteristics and Hadoop Technology , 2015 .

[16]  Min Chen,et al.  Job schedulers for Big data processing in Hadoop environment: testing real-life schedulers using benchmark programs , 2017, Digit. Commun. Networks.

[17]  Marianthi G. Ierapetritou,et al.  Integration of scheduling and control under uncertainties: Review and challenges , 2016 .

[18]  Bu-Sung Lee,et al.  Dynamic Job Ordering and Slot Configurations for MapReduce Workloads , 2016, IEEE Transactions on Services Computing.

[19]  Yuri Demchenko,et al.  Architecture Framework and Components for the Big Data Ecosystem , 2013 .

[20]  Bhawani Shankar Chowdhry,et al.  Storage-Tag-Aware Scheduler for Hadoop Cluster , 2017, IEEE Access.

[21]  Jordi Torres,et al.  Deadline-Based MapReduce Workload Management , 2013, IEEE Transactions on Network and Service Management.

[22]  Patrick Martineau,et al.  Experimental Study on Performance and Energy Consumption of Hadoop in Cloud Environments , 2016, CLOSER.

[23]  Namrata Singh,et al.  A review of research on MapReduce scheduling algorithms in Hadoop , 2015, International Conference on Computing, Communication & Automation.

[24]  Jie Wu,et al.  Dache: A data aware caching for big-data applications using the MapReduce framework , 2014 .

[25]  Mohsen Guizani,et al.  Internet of Things: A Survey on Enabling Technologies, Protocols, and Applications , 2015, IEEE Communications Surveys & Tutorials.

[26]  Feng Li,et al.  SLA-aware energy-efficient scheduling scheme for Hadoop YARN , 2015, 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems.