Optimization Scheme for Small Files Storage Based on Hadoop Distributed File System

Hadoop Distributed File System (HDFS) becomes a representative cloud platform, benefiting from its reliable, scalable and low-cost storage capability. However, HDFS does not present good storage and access performance when processing a huge number of small files, because massive small files bring heavy burden on NameNode of HDFS. Meanwhile, HDFS does not provide any optimization solution for storing and accessing small files, as well as no prefetching mechanism to reduce I/O operations. This paper proposes an optimized scheme, Structured Index File Merging-SIFM, using two level file indexes, the structured metadata storage, and prefetching and caching strategy, to reduce the I/O operations and improve the access efficiency. Extensive experiments demonstrate that the proposed SIFM can effectively achieve better performance in the terms of the storing and accessing for a large number of small files on HDFS, compared with native HDFS and HAR.

[1]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[2]  David E. Bernholdt,et al.  Monitoring the Earth System Grid with MDS4 , 2006, 2006 Second IEEE International Conference on e-Science and Grid Computing (e-Science'06).

[3]  Baogang Wei,et al.  A digital library architecture supporting massive small files and efficient replica maintenance , 2010, JCDL '10.

[4]  Jun Wang,et al.  Improving metadata management for small files in HDFS , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[5]  Jason Venner,et al.  Pro Hadoop , 2009 .

[6]  Qinghua Zheng,et al.  An optimized approach for storing and accessing small files on cloud storage , 2012, J. Netw. Comput. Appl..

[7]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[8]  WenAn Tan,et al.  QoS Constraint Based Workflow Scheduling for Cloud Computing Services , 2014, J. Softw..

[9]  Eric H. Neilsen The Sloan Digital Sky Survey Data Archive Server , 2008, Computing in Science & Engineering.

[10]  Eric Pardede,et al.  A Survey on Data Security Issues in Cloud Computing: From Single to Multi-Clouds , 2013, J. Softw..

[11]  Rodger Staden,et al.  ZTR: a new format for DNA sequence trace data , 2002, Bioinform..

[12]  Jie Huang,et al.  The Workflow Task Scheduling Algorithm Based on the GA Model in the Cloud Computing Environment , 2014, J. Softw..

[13]  Xubin He,et al.  Implementing WebGIS on Hadoop: A case study of improving small file I/O performance on HDFS , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.