SepStore: Data Storage Accelerator for Distributed File Systems by Separating Small Files from Large Files

Distributed file systems often rely on disk file systems for storing data on disks. Disk file systems can do a relative good performance on large files than small files as sequential access patterns often exhibit for large files. This paper improves the performance of data servers for distributed file systems by improving the performance for small files. A LSM structure based key-value store is used for storing the data for small files for transforming the random access to sequential access as well as reducing the metadata of disk file systems. The key-value store is also used as the index for accessing small files. Experimental results showed that our method could improve the throughput up to 78% as well as 37% improvement on IOPS.

[1]  Patrick E. O'Neil,et al.  The log-structured merge-tree (LSM-tree) , 1996, Acta Informatica.

[2]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[3]  Sanjeev Kumar,et al.  Finding a Needle in Haystack: Facebook's Photo Storage , 2010, OSDI.

[4]  Erez Zadok,et al.  Building workload-independent storage with VT-trees , 2013, FAST.

[5]  Kai Ren,et al.  TABLEFS: Enhancing Metadata Efficiency in the Local File System , 2013, USENIX Annual Technical Conference.

[6]  Raghu Ramakrishnan,et al.  bLSM: a general purpose log structured merge tree , 2012, SIGMOD Conference.

[7]  Jun Wang,et al.  Improving metadata management for small files in HDFS , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[8]  Qinghua Zheng,et al.  A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: A Case Study by PowerPoint Files , 2010, 2010 IEEE International Conference on Services Computing.

[9]  S.A. Brandt,et al.  CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[10]  David R. Karger,et al.  Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web , 1997, STOC '97.

[11]  Howard Gobioff,et al.  The Google file system , 2003, SOSP '03.

[12]  Andrea C. Arpaci-Dusseau,et al.  Analysis of HDFS under HBase: a facebook messages case study , 2014, FAST.

[13]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.