Hadoop Perfect File: A fast access container for small files with direct in disc metadata access

Storing and processing massive small files is one of the major challenges for the Hadoop Distributed File System (HDFS). In order to provide fast data access, the NameNode (NN) in HDFS maintains the metadata of all files in its main-memory. Hadoop performs well with a small number of large files that require relatively little metadata in the NN s memory. But for a large number of small files, Hadoop has problems such as NN memory overload caused by the huge metadata size of these small files. We present a new type of archive file, Hadoop Perfect File (HPF), to solve HDFS s small files problem by merging small files into a large file on HDFS. Existing archive files offer limited functionality and have poor performance when accessing a file in the merged file due to the fact that during metadata lookup it is necessary to read and process the entire index file(s). In contrast, HPF file can directly access the metadata of a particular file from its index file without having to process it entirely. The HPF index system uses two hash functions: file s metadata are distributed through index files by using a dynamic hash function and, for each index file, we build an order preserving perfect hash function that preserves the position of each file s metadata in the index file. The HPF design will only read the part of the index file that contains the metadata of the searched file during its access. HPF file also supports the file appending functionality after its creation. Our experiments show that HPF can be more than 40% faster file s access from the original HDFS. If we don t consider the caching effect, HPF s file access is around 179% faster than MapFile and 11294% faster than HAR file. If we consider caching effect, HPF is around 35% faster than MapFile and 105% faster than HAR file.

[1]  Seif Haridi,et al.  HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases , 2016, FAST.

[2]  Divyashikha Sethia,et al.  Optimized MapFile Based Storage of Small Files in Hadoop , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[3]  Mikael Ronström,et al.  HopsFS: scaling hierarchical file system metadata using newSQL databases , 2017, FAST 2017.

[4]  Guisheng Fan,et al.  A Method to Improve the Performance for Storing Massive Small Files in Hadoop , 2017 .

[5]  Kenli Li,et al.  Performance Optimization for Managing Massive Numbers of Small Files in Distributed File Systems , 2015, IEEE Transactions on Parallel and Distributed Systems.

[6]  Edward A. Fox,et al.  A faster algorithm for constructing minimal perfect hash functions , 1992, SIGIR '92.

[7]  Hui He,et al.  Optimization strategy of Hadoop small file storage for big data in healthcare , 2015, The Journal of Supercomputing.

[8]  Xun Cai,et al.  An optimization strategy of massive small files storage based on HDFS , 2018 .

[9]  Weipeng Jing,et al.  An optimized method of HDFS for massive small files storage , 2018, Comput. Sci. Inf. Syst..

[10]  Priyanka Phakade,et al.  An Innovative Strategy for Improved Processing of Small Files in Hadoop , 2014 .

[11]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[12]  Kyoung Soo Bok,et al.  An efficient distributed caching for accessing small files in HDFS , 2017, Cluster Computing.

[13]  George Havas,et al.  An Optimal Algorithm for Generating Minimal Perfect Hash Functions , 1992, Inf. Process. Lett..

[14]  Natawut Nupairoj,et al.  Improving performance of small-file accessing in Hadoop , 2014, 2014 11th International Joint Conference on Computer Science and Software Engineering (JCSSE).

[15]  Sebastiano Vigna,et al.  Monotone minimal perfect hashing: searching a sorted table with O(1) accesses , 2009, SODA.

[16]  Rasmus Pagh,et al.  Simple and Space-Efficient Minimal Perfect Hash Functions , 2007, WADS.

[17]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.

[18]  Yang Yang,et al.  SFS: A massive small file processing middleware in Hadoop , 2016, 2016 18th Asia-Pacific Network Operations and Management Symposium (APNOMS).

[19]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[20]  Qinghua Zheng,et al.  A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: A Case Study by PowerPoint Files , 2010, 2010 IEEE International Conference on Services Computing.

[21]  Asit Dan,et al.  An approximate analysis of the LRU and FIFO buffer replacement schemes , 1990, SIGMETRICS '90.

[22]  Junho Choi,et al.  Improved performance optimization for massive small files in cloud computing environment , 2018, Ann. Oper. Res..

[23]  Jun Cai,et al.  Hadoop Massive Small File Merging Technology Based on Visiting Hot-Spot and Associated File Optimization , 2018, BICS.

[24]  Ronald Fagin,et al.  Extendible hashing—a fast access method for dynamic files , 1979, ACM Trans. Database Syst..

[25]  Yanlong Zhai,et al.  LHF: A New Archive Based Approach to Accelerate Massive Small Files Access Performance in HDFS , 2019, 2019 IEEE Fifth International Conference on Big Data Computing Service and Applications (BigDataService).

[26]  Fabiano C. Botelho,et al.  A New Algorithm for Constructing Minimal Perfect Hash Functions , 2004 .

[27]  Sanjeev Kumar,et al.  Finding a Needle in Haystack: Facebook's Photo Storage , 2010, OSDI.

[28]  Fabiano C. Botelho,et al.  Near-Optimal Space Perfect Hashing Algorithms , 2009 .

[29]  Edward A. Fox,et al.  Order preserving minimal perfect hash functions and information retrieval , 1989, SIGIR '90.

[30]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[31]  Jim Dowling,et al.  Size Matters: Improving the Performance of Small Files in Hadoop , 2018, Middleware.

[32]  Heon Young Yeom,et al.  Improving Small File I/O Performance for Massive Digital Archives , 2017, 2017 IEEE 13th International Conference on e-Science (e-Science).

[33]  Yannis Manolopoulos,et al.  Extendible Hashing , 2009, Encyclopedia of Database Systems.