Metadata-Aware Small Files Storage Architecture on Hadoop

The ZB (trillion GB) scales of data produced globally each year, making the distributed data storage become a trend. Research and application on Hadoop which is the most representative open source distributed file system is increasing. However, Hadoop is not suitable for handling massive small files, this paper presents a metadata-aware storage architecture for massive small files, taking full advantage of the metadata of file, merging the small files into Sequence File by the classification algorithm of merge module, and the efficient indexing mechanism be introduced, make a good solution to the problem about the bottleneck of NameNode memory. Taking MP3 files as an example, the experiments show that the architecture can obtain good results.

[1]  Howard Gobioff,et al.  The Google file system , 2003, SOSP '03.

[2]  Xubin He,et al.  Implementing WebGIS on Hadoop: A case study of improving small file I/O performance on HDFS , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[3]  Qinghua Zheng,et al.  A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: A Case Study by PowerPoint Files , 2010, 2010 IEEE International Conference on Services Computing.

[4]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[5]  Jun Wang,et al.  Improving metadata management for small files in HDFS , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.