Using provenance to efficiently improve metadata searching performance in storage systems

In cloud storage systems, more than 50% of requests are metadata operations, and thus the file system metadata search performance has become increasingly important to different users. With the rapid growth of storage system scales in volume, traditional full-size index trees cannot offer high-performance metadata search due to hierarchical indexing bottleneck. In order to alleviate the long latency and improve the quality-of-service (QoS) in cloud storage service, we proposed a novel provenance based metadata-search system, called PROMES. The metadata search in PROMES is split into three phases: (i) leveraging correlation-aware metadata index tree to identify several files as seeds, most of which can satisfy the query requests, (ii) using the seeds to find the remaining query results via relationship graph search, (iii) continuing to refine and rerank the whole search results, and sending the final results to users. PROMES has the salient features of high query accuracy and low latency, due to files' tight and lightweight indexing in relationship graph coming from provenance's analysis. Compared with state-of-the-art metadata searching schemes, PROMES demonstrates its efficiency and efficacy in terms of query accuracy and response latency. We propose a high-performance cost-effective provenance based metadata-search system.The usage of relationship graph can reduce the overhead of metadata searching.The approach exploits the time consumption of constructing relationships.We leverage the files' weights to improve the accuracy of searching metadata.

[1]  Margo I. Seltzer,et al.  Provenance-Aware Storage Systems , 2006, USENIX ATC, General Track.

[2]  Sridhar Mahadevan,et al.  DROP: Facilitating distributed metadata management in EB-scale storage systems , 2013, 2013 IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST).

[3]  Yogesh L. Simmhan,et al.  Performance Evaluation of the Karma Provenance Framework for Scientific Workflows , 2006, IPAW.

[4]  Fred Douglis,et al.  Characteristics of backup workloads in production systems , 2012, FAST.

[5]  Yogesh L. Simmhan,et al.  The Open Provenance Model core specification (v1.1) , 2011, Future Gener. Comput. Syst..

[6]  Carole A. Goble,et al.  Workflows to open provenance graphs, round-trip , 2011, Future Gener. Comput. Syst..

[7]  David B. Leake,et al.  A Noisy 10GB Provenance Database , 2011, Business Process Management Workshops.

[8]  Kun Zhou,et al.  Real-time KD-tree construction on graphics hardware , 2008, SIGGRAPH 2008.

[9]  Shankar Pasupathy,et al.  High-performance metadata indexing and search in petascale data storage systems , 2008 .

[10]  Ling Liu,et al.  Distance-aware bloom filters: Enabling collaborative search for efficient resource discovery , 2013, Future Gener. Comput. Syst..

[11]  David R. Swanson,et al.  A Versatile Searchable File System for HPC Analytics , 2018 .

[12]  Margo I. Seltzer,et al.  Issues in Automatic Provenance Collection , 2006, IPAW.

[13]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[14]  Kannan Govindarajan,et al.  CLOUDRB: A framework for scheduling and managing High-Performance Computing (HPC) applications in science cloud , 2014, Future Gener. Comput. Syst..

[15]  James Frew,et al.  Automatic capture and reconstruction of computational provenance , 2008 .

[16]  Xiaoning Ding,et al.  DULO: an effective buffer cache management scheme to exploit both temporal and spatial locality , 2005, FAST'05.

[17]  Garth A. Gibson,et al.  Scale and Concurrency of GIGA+: File System Directories with Millions of Files , 2011, FAST.

[18]  Margo I. Seltzer,et al.  Provenance for the Cloud , 2010, FAST.

[19]  Simone Stumpf,et al.  TaskTracer: Enhancing Personal Information Management Through Machine Learning , 2006 .

[20]  Keiko Yamamoto,et al.  Provenance Based Retrieval: File Retrieval System Using History of Moving and Editing in User Experience , 2011, 2011 IEEE 35th Annual Computer Software and Applications Conference.

[21]  Jacek Kitowski,et al.  QoS-based storage resources provisioning for grid applications , 2013, Future Gener. Comput. Syst..

[22]  Darrell D. E. Long,et al.  Security Aware Partitioning for efficient file system search , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[23]  Marios Hadjieleftheriou,et al.  R-Trees - A Dynamic Index Structure for Spatial Searching , 2008, ACM SIGSPATIAL International Workshop on Advances in Geographic Information Systems.

[24]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[25]  Ahmet Can,et al.  Characterizing Queries in Different Search Tasks , 2012, 2012 45th Hawaii International Conference on System Sciences.

[26]  Darrell D. E. Long,et al.  Examining extended and scientific metadata for scalable index designs , 2013, SYSTOR '13.

[27]  Dominique L. Scapin,et al.  What do people recall about their documents?: implications for desktop search tools , 2007, IUI '07.

[28]  Shankar Pasupathy,et al.  Spyglass: Fast, Scalable Metadata Search for Large-Scale Storage Systems , 2009, FAST.

[29]  Hong Jiang,et al.  CLU: Co-Optimizing Locality and Utility in Thread-Aware Capacity Management for Shared Last Level Caches , 2014, IEEE Transactions on Computers.

[30]  Hong Jiang,et al.  SmartStore: a new metadata organization paradigm with semantic-awareness for next-generation file systems , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[31]  Brian D. Noble,et al.  Using Provenance to Aid in Personal File Search , 2007, USENIX Annual Technical Conference.

[32]  Erik Riedel,et al.  A Framework for Evaluating Storage System Security , 2002, FAST.