Improving Hadoop Performance in Handling Small Files

Hadoop, created by Doug Cutting, is a top-level Apache project that supports distributed applications which involves thousands of nodes and huge amount of data. It is a software framework under a free license, inspired by Google’s MapReduce and Google File System papers. It is being developed by a global community of contributors, using Java. Hadoop is used world-wide by organizations for research as well as production.Hadoop includes Hadoop Common,Hadoop Distributed File System(HDFS) and MapReduce as its subprojects. Hadoop Common consists of the common utilities that support the other Hadoop subprojects. HDFS is a distributed file system which adds to the high performance of Hadoop by giving high througput access to application data. It also improves reliability by replication of data, and maintains data integrity as well.MapReduce is a software framework based on MapReduce algorithm to perform distributed computation involving huge amount of data on clusters. Although Hadoop is widely used, its full potential is not yet put to use because of some issues, the small files problem being one of them.Hadoop Archives was introduced as a solution for the small files problem for the Hadoop Version 0.18.0 onwards. Sequence files are also used as an alternative solution.Both has their respective merits and demerits. We propose a solution which is expected to derive their merits while ensuring a better performance of Hadoop.

[1]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[2]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[3]  Mahadev Satyanarayanan,et al.  A SURVEY OF DISTRIBUTED FILE SYSTEMS , 1990 .

[4]  Jun Wang,et al.  Improving metadata management for small files in HDFS , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.