A Novel Framework to optimize I/O Cost in MapReduce: An Index based Solution

Abstract Hadoop-MapReduce framework was the first framework offering distributed data processing with features like scalability, fault-tolerance, flexible programming model etc. It gained its popularity by the influx of heavy business intelligence and data analytics applications. The high performance of Map-Reduce in such applications requires selective data access and lower response time. Though there has been a variety of frameworks proposed in the state-of-art to improve MapReduce performance by integrating DBMS like features such as indexes, modelling map and reduce phases, using different data layouts etc; they require changes in the existing underlying storage system (HDFS). This paper proposes a novel framework to create indexes based on HDFS splits. The framework allows MapReduce applications to access only relevant data a query is looking for without making any changes in existing data layouts or file organization in HDFS. As the main factors that affect MapReduce application performance are IO usage and CPU usage, the key objective of this work is to optimize IO cost of MR applications because most Hadoop jobs are IO bound.