SSD 기반 시스템에서 셔플 과정 최적화를 통한 하둡 맵리듀스의 처리속도 향상 기법

MapReduce is a programming model widely used for processing big data in cloud datacenter. It is composed of Map, Shuffle and Reduce phases. Hadoop MapReduce is one of the most popular framework implementing MapReduce. During Shuffle phase, Hadoop MapReduce performs an excessive number of disk I/O operations and the transmission of large data. This accounts for about 40% of total data processing time. In order to solve these problems, we propose a new shuffle mechanism using the characteristics of SSD. This mechanism consists of (1) data address based sorting, (2) data address based merging and (3) early data transmission before Map phase completion. In order to demonstrate the effectiveness of our approach, we have implemented this mechanism into Hadoop MapReduce 1.2.1. Our experiments show that the proposed mechanism reduces the job completion time up to 5% compared to that of the legacy Hadoop MapReduce.