SMURF: Efficient and Scalable Metadata Access for Distributed Applications from Edge to the Cloud

In parallel with big data processing and analysis dominating the usage of distributed and cloud infrastructures, the demand for distributed metadata access and transfer has increased. In many application domains, the volume of data generated exceeds petabytes, while the corresponding metadata amounts to terabytes or even more. In this paper, we propose a novel solution for efficient and scalable metadata access for distributed applications across wide-area networks, dubbed SMURF. Our solution combines novel pipelining and concurrent transfer mechanisms with reliability, provides distributed continuum caching and prefetching strategies to sidestep fetching latency, and achieves scalable and high-performance metadata fetch/prefetch services in the cloud. We also study the phenomenon of semantic locality in real trace logs which is not well utilized in metadata access prediction. We implement our predictor based on this observation and compare it with three existing state-of-the-art prefetch schemes on Yahoo! Hadoop audit traces. By effectively caching and prefetching metadata based on the access patterns, our continuum caching and prefetching mechanism greatly improves local cache hit rate and reduces the average fetching latency. We replayed approximately 20 Million metadata access operations from real audit traces, in which our system achieved 80% accuracy during prefetch prediction and reduced the average fetch latency 50% compared to the state-of-the-art mechanisms.

[1]  Scott A. Brandt,et al.  Dynamic Metadata Management for Petabyte-Scale File Systems , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[2]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[3]  Bing Zhang,et al.  DLS: a cloud-hosted data caching and prefetching service for distributed metadata access , 2015, Int. J. Big Data Intell..

[4]  W. Allcock,et al.  GridFTP protocol specification , 2002 .

[5]  Hong Jiang,et al.  Nexus: a novel weighted-graph-based prefetching algorithm for metadata servers in petabyte-scale storage systems , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).

[6]  Kevin Ashton,et al.  That ‘Internet of Things’ Thing , 1999 .

[7]  Ron Kohavi,et al.  Practical guide to controlled experiments on the web: listen to your customers not to the hippo , 2007, KDD '07.

[8]  Julie M. Johnson,et al.  Transportation, land use and sustainability , 1994 .

[9]  Audris Mockus,et al.  Patterns of folder use and project popularity: a case study of github repositories , 2014, ESEM '14.

[10]  Hong Jiang,et al.  AMP: An Affinity-Based Metadata Prefetching Scheme in Large-Scale Distributed Storage Systems , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[11]  Thomas E. Anderson,et al.  A Comparison of File System Workloads , 2000, USENIX Annual Technical Conference, General Track.

[12]  Hong Jiang,et al.  FARMER: A novel approach to file access correlation mining and evaluation reference model , 2008, HPDC '08.

[13]  Jonathan B. Postel,et al.  RFC 959: File transfer protocol , 1985 .