A Throughput Driven Task Scheduler for Improving MapReduce Performance in Job-Intensive Environments

MapReduce has been proven to be a highly desirable platform for scalable parallel data analysis. The task scheduling in MapReduce is very crucial for the job execution and has a marked impact on the system performance. To the best of our knowledge, the previous scheduling algorithms rarely consider the job-intensive environments and are not able to provide high system throughput. Hence this paper proposes a novel technique for job-intensive scheduling to improve the system throughput. Firstly, by making an in-depth analysis of job-intensive environments, we sum up 4 major factors which affect the system throughput. Secondly, based on the factors, an efficient technique, called throughput driven task scheduler is proposed, in which, we adopt a series of effective measures to improve the throughput of a MapReduce cluster system. Finally, plenty of simulation experiments are made and the experimental results show that the scheduler can provide higher throughput than the previous systems and is able to meet the requirements of practical job-intensive applications.