Storage and Accessing Small Files Based on HDFS

Hadoop distributed file system (HDFS) becomes a representative cloud platform, benefiting from its reliable, scalable and low-cost storage capability. Unfortunately, HDFS does not perform well for huge number of small files because massive small files imposed heavy burden on NameNode of HDFS. This paper introduces an optimized scheme, structured index file merging (SIFM), using two-level index file and structure metadata storage, to reduce the I/O operations and improve the access efficiency. Extensive experiments demonstrate that the proposed SIFM can effectively achieve better performance in terms of storing and accessing for huge number of small files on HDFS, compared with native HDFS and Hadoop Archive (HAR).

[1]  Eric H. Neilsen The Sloan Digital Sky Survey Data Archive Server , 2008, Computing in Science & Engineering.

[2]  Rodger Staden,et al.  ZTR: a new format for DNA sequence trace data , 2002, Bioinform..

[3]  Baogang Wei,et al.  A digital library architecture supporting massive small files and efficient replica maintenance , 2010, JCDL '10.

[4]  Xubin He,et al.  Implementing WebGIS on Hadoop: A case study of improving small file I/O performance on HDFS , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[5]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[6]  Min Luo,et al.  Comparing Hadoop and Fat-Btree Based Access Method for Small File I/O Applications , 2010, WAIM.

[7]  Jun Wang,et al.  Improving metadata management for small files in HDFS , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.