MOSM: An approach for efficient storing massive small files on Hadoop

Benefiting from its high scalability and high reliability, Hadoop has become a popular big data processing platform at present. Hadoop Distributed File System (HDFS) which is one of the cores of Hadoop can efficiently store large files. However, massive small files stored in the HDFS cause the “small files problem” due to the bottleneck of NameNode memory and access performance. To solve the defect for storing massive small files, we propose a multilevel optimization storage method (MOSM), which optimizes the storage process of small files. We use an algorithm to merge small files into large files to reduce memory utilization of NameNode. After merging, we design an efficient hybrid index strategy and a prefetching cache mechanism to improve the speed of small files accessing. The experimental results indicate that the MOSM is able to reduce the load of NameNode effectively and improve the ability of Hadoop cluster to store numerous small files.

[1]  Qinghua Zheng,et al.  An optimized approach for storing and accessing small files on cloud storage , 2012, J. Netw. Comput. Appl..

[2]  Tao Wang,et al.  An Effective Strategy for Improving Small File Problem in Distributed File System , 2015, 2015 2nd International Conference on Information Science and Control Engineering.

[3]  Wenjie Liu,et al.  Optimized Data Replication for Small Files in Cloud Storage Systems , 2016 .

[4]  Xubin He,et al.  Implementing WebGIS on Hadoop: A case study of improving small file I/O performance on HDFS , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[5]  Parth Gohil,et al.  A novel approach to improve the performance of Hadoop in handling of small files , 2015, 2015 IEEE International Conference on Electrical, Computer and Communication Technologies (ICECCT).

[6]  B. Prabavathy,et al.  A novel indexing scheme for efficient handling of small files in Hadoop Distributed File System , 2013, 2013 International Conference on Computer Communication and Informatics.

[7]  Ganggang Zhang,et al.  Improving the Efficiency of Storing for Small Files in HDFS , 2012, 2012 International Conference on Computer Science and Service System.

[8]  Qinghua Zheng,et al.  A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: A Case Study by PowerPoint Files , 2010, 2010 IEEE International Conference on Services Computing.

[9]  Improving the Performance of Processing for Small Files in Hadoop : A Case Study of Weather Data Analytics , 2014 .