A Survey on Workload Classification and Job Scheduling by using Johnson's Algorithm under Hadoop Environment

Bigdata deals with the larger datasets which focus on storing, sharing and processing the data. The organisation face difficulties to create, manipulate and manage the large datasets. For example, if we take the social media Facebook,there will be some posts on the page.The number of likes, shares and comments are given at a second for a particular post,it leads to creation of large datasets which gives trouble to store the data and process the data. It involves massive volume of both structured and unstructured data.The major problem exists in Bigdata community is workload classification and scheduling of jobs with respect to the disks. Identifying the computation time of individual jobs in the machine uses mapreduce concepts rather than minimizing the overall computation time of entire set of jobs. Mapreduce algorithm is initially applied for splitting the larger datsets into minimized output dataset. Mapreduce consists of two phases for processing the data: map and reduce phases. Under map phase,the given radar input dataset is splitted into individual key-value pairs and an intermediate output is obtained and in reduce phase that key value pair undergoes shuffle and sort operation. Intermediate files are created from map tasks are written to local disk and output files are written to distributed file system of Hadoop. The different types of jobs are given to different disks for the process of scheduling. Johnson’s algorithm is used for obtaining the minimum optimal solution among different jobs given in the Hadoop environment. Job type and data locality of the jobs are two important factors for job scheduling process. The Performance analysis of individual disks are calculated on the basis of size of the dataset taken and formation of number of nodes.

[1]  Jin-Soo Kim,et al.  HPMR: Prefetching and pre-shuffling in shared MapReduce computation environment , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[2]  Roy H. Campbell,et al.  ARIA: automatic resource inference and allocation for mapreduce environments , 2011, ICAC '11.

[3]  Gueyoung Jung,et al.  Synchronous Parallel Processing of Big-Data Analytics Services to Optimize Performance in Federated Clouds , 2012, 2012 IEEE Fifth International Conference on Cloud Computing.

[4]  Roy H. Campbell,et al.  Orchestrating an Ensemble of MapReduce Jobs for Minimizing Their Makespan , 2013, IEEE Transactions on Dependable and Secure Computing.

[5]  Matei Zaharia,et al.  Job Scheduling for Multi-User MapReduce Clusters , 2009 .

[6]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[7]  Cheng-Zhong Xu,et al.  Interference and locality-aware task scheduling for MapReduce applications in virtual clusters , 2013, HPDC.

[8]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[9]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[10]  Mahmut T. Kandemir,et al.  MROrchestrator: A Fine-Grained Resource Orchestration Framework for MapReduce Clusters , 2012, 2012 IEEE Fifth International Conference on Cloud Computing.

[11]  Bu-Sung Lee,et al.  Dynamic slot allocation technique for MapReduce clusters , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).