论文信息 - A Novel Framework to optimize I/O Cost in MapReduce: An Index based Solution

A Novel Framework to optimize I/O Cost in MapReduce: An Index based Solution

Abstract Hadoop-MapReduce framework was the first framework offering distributed data processing with features like scalability, fault-tolerance, flexible programming model etc. It gained its popularity by the influx of heavy business intelligence and data analytics applications. The high performance of Map-Reduce in such applications requires selective data access and lower response time. Though there has been a variety of frameworks proposed in the state-of-art to improve MapReduce performance by integrating DBMS like features such as indexes, modelling map and reduce phases, using different data layouts etc; they require changes in the existing underlying storage system (HDFS). This paper proposes a novel framework to create indexes based on HDFS splits. The framework allows MapReduce applications to access only relevant data a query is looking for without making any changes in existing data layouts or file organization in HDFS. As the main factors that affect MapReduce application performance are IO usage and CPU usage, the key objective of this work is to optimize IO cost of MR applications because most Hadoop jobs are IO bound.

S. Taruna | N. K. Seera

[1] Jorge-Arnulfo Quiané-Ruiz,et al. Towards Zero-Overhead Adaptive Indexing in Hadoop , 2012, ArXiv.

[2] S. Taruna,et al. Analyzing Cost Parameters Affecting Map Reduce Application Performance , 2016 .

[3] Gang Chen,et al. Indexing metric uncertain data for range queries and range joins , 2017, The VLDB Journal.

[4] Yon Dohn Chung,et al. Parallel data processing with MapReduce: a survey , 2012, SGMD.

[5] Jimmy Lin,et al. Full-text indexing for optimizing selection operations in large-scale data analytics , 2011, MapReduce '11.

[6] Beng Chin Ooi,et al. Efficient B-tree based indexing for cloud data processing , 2010, Proc. VLDB Endow..

[7] Jorge-Arnulfo Quiané-Ruiz,et al. Towards zero-overhead static and adaptive indexing in Hadoop , 2013, The VLDB Journal.

[8] Beng Chin Ooi,et al. Indexing multi-dimensional data in a cloud system , 2010, SIGMOD Conference.

[9] Vinay Setty,et al. Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing) , 2010, Proc. VLDB Endow..

[10] Beng Chin Ooi,et al. The performance of MapReduce , 2010, Proc. VLDB Endow..

[11] Christopher Ré,et al. Automatic Optimization for MapReduce Programs , 2011, Proc. VLDB Endow..

[12] Gang Chen,et al. A Framework for supporting DBMS-like indexes in the cloud , 2011, Proc. VLDB Endow..

[13] Pete Wyckoff,et al. Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..