An Effective Strategy for Improving Small File Problem in Distributed File System

Distributed file systems, such as HDFS, DFS, etc, are adopted to support cloud storage and are designed for optimizing large files access. But unfortunately, the problem of massive small files is neglected and seriously restricts the performance of distributed file systems. To improve and even solve the small file problem, in this paper, user access task is defined. The correlations among the access tasks, applications and access files are constructed by the improved PLSA, and the research object is transferred from file-level to task-level. Then, an effective strategy is proposed to improving small file problem in distributed file system. The strategy merges small files in term of access tasks and selects perfecting targets based on the transition probability of the tasks. Finally, the system efficiency analysis model is established and experimental results, compared with original HDFS, HAR and the schemes of Dong, demonstrate that the proposed strategy effectively reduce the MDS workload and the request response delay.

[1]  Shao Zhenfeng Design and Implementation of Service-Oriented Spatial Information Sharing Framework for Digital City , 2008 .

[2]  Yan Shen,et al.  A Novel Scalable Architecture of Cloud Storage System for Small Files Based on P2P , 2012, 2012 IEEE International Conference on Cluster Computing Workshops.

[3]  Jun Wang,et al.  Improving metadata management for small files in HDFS , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[4]  Xin Jin,et al.  Web usage mining based on probabilistic latent semantic analysis , 2004, KDD.

[5]  Deren Li,et al.  Design and implementation of service-oriented spatial information sharing framework in digital city , 2009 .

[6]  Zhuang Wei Improving the Storage Efficiency of Small Files in Cloud Storage , 2011 .

[7]  Liu Lin,et al.  A Strategy of Small File Storage Access with Performance Optimization , 2012 .

[8]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[9]  Xubin He,et al.  Implementing WebGIS on Hadoop: A case study of improving small file I/O performance on HDFS , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[10]  Wenying Zeng,et al.  Research on cloud storage architecture and key technologies , 2009, ICIS.

[11]  Xiong Jin Dawning Nebula Distributed File System HVFS:for Large Scale Small File Access , 2012 .

[12]  Qinghua Zheng,et al.  An optimized approach for storing and accessing small files on cloud storage , 2012, J. Netw. Comput. Appl..

[13]  Kanad Ghose,et al.  hFS: a hybrid file system prototype for improving small file and metadata performance , 2007, EuroSys '07.

[14]  Andrew Zisserman,et al.  Scene Classification Via pLSA , 2006, ECCV.