A novel approach to improve the performance of Hadoop in handling of small files

Hadoop, an open source java framework deals with big data. It has mainly two core components: HDFS (Hadoop distributed file system) which stores large amount of data in a reliable manner and another is MapReduce which is a programming model which processes the data in parallel and distributed manner. Hadoop does not perform well for small files as a large number of small files pose a heavy burden on the NameNode of HDFS and an increase in execution time for MapReduce is encountered. Hadoop is designed to handle huge size files and hence suffers a performance penalty while dealing with large number of small files. This research work gives an introduction about HDFS, small file problem and existing ways to deal with it these problems along with proposed approach to handle small files. In proposed approach, merging of small file is done using MapReduce programming model on Hadoop. This approach improves the performance of Hadoop in handling of small files by ignoring the files whose size is larger than the block size of Hadoop and also reduces the memory required by NameNode to store them.

[1]  Jun Wang,et al.  Improving metadata management for small files in HDFS , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[2]  Ganggang Zhang,et al.  Improving the Efficiency of Storing for Small Files in HDFS , 2012, 2012 International Conference on Computer Science and Service System.

[3]  Qinghua Zheng,et al.  A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: A Case Study by PowerPoint Files , 2010, 2010 IEEE International Conference on Services Computing.

[4]  Yang Zhang,et al.  Improving the Efficiency of Storing for Small Files in HDFS , 2012 .

[5]  B. Prabavathy,et al.  A novel indexing scheme for efficient handling of small files in Hadoop Distributed File System , 2013, 2013 International Conference on Computer Communication and Informatics.

[6]  Kumar Swamy Pamu,et al.  Reduction of Data at Namenode in HDFS using harballing Technique , 2012 .

[7]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).