FARMER: A novel approach to file access correlation mining and evaluation reference model

File semantic has proven effective in optimizing large scale distributed file system. As a consequence of the elaborate and rich I/O interfaces between upper layer applications and file systems, file system can provide useful and insightful information about semantic. Hence, file semantic mining has become an increasingly important practice in both engineering and research community. Unfortunately, it is a challenge to exploit file semantic knowledge because a variety of factors could affect this information exploration process. Even worse, the challenges are exacerbated due to the intricate interdependency between these factors, and make it difficult to fully exploit the potentially important correlation among various semantic knowledges. This article proposes a file access correlation miming and evaluation reference (FARMER) model, where file is treated as a multivariate vector space, and each item within the vector corresponds a separate factor of the given file. The selection of factor depends on the application, examples of factors are file path, creator and executing program. If one particular factor occurs in both files, its value is non-zero. It is clear that the extent of inter-file relationships can be measured based on the likeness of their factor values in the semantic vectors. Benefit from this model, FARMER represents files as structured vectors of identifiers, and basic vector operations can be leveraged to quantify file correlation between two file vectors. FARMER model leverages linear regression model to estimate the strength of the relationship between file correlation and a set of influencing factors so that the “bad knowledge” can be filtered out. To demonstrate the ability of new FARMER model, FARMER is incorporated into a real large-scale object-based storage system as a case study to dynamically infer file correlations. In addition FARMER-enabled optimize service for metadata prefetching algorithm and object data layout algorithm is implemented. Experimental results show that is FARMER-enabled prefetching algorithm is shown to reduce the metadata operations latency by approximately 30%–40% when compared to a state-of-the-art metadata prefetching algorithm and a commonly used replacement policy.

[1]  Gregory R. Ganger,et al.  Attribute-Based Prediction of File Properties , 2003 .

[2]  Qun Liu,et al.  HUSt: a heterogeneous unified storage system for GIS grid , 2006, SC.

[3]  Geoffrey Z. Liu Semantic vector space model : Implementation and evaluation , 1997 .

[4]  Kevin Wilkinson,et al.  Maintaining Consistency of Client-Cached Data , 1990, VLDB.

[5]  Andrea C. Arpaci-Dusseau,et al.  Semantically-smart disk systems: past, present, and future , 2006, PERV.

[6]  Pierre Jouvelot,et al.  Semantic file systems , 1991, SOSP '91.

[7]  Geoffrey H. Kuenning,et al.  The Design of the SEER Predictive Caching System , 1994, 1994 First Workshop on Mobile Computing Systems and Applications.

[8]  Ibrahim F. Haddad,et al.  PVFS: A Parallel Virtual File System for Linux Clusters , 2000 .

[9]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[10]  Gregory R. Ganger,et al.  Self-* Storage: Brick-based Storage with Automated Administration (CMU-CS-03-178) , 2003 .

[11]  Darrell D. E. Long,et al.  The case for efficient file access pattern modeling , 1999, Proceedings of the Seventh Workshop on Hot Topics in Operating Systems.

[12]  Neoklis Polyzotis,et al.  Searching a file system using inferred semantic links , 2005, HYPERTEXT '05.

[13]  Hong Jiang,et al.  Nexus: a novel weighted-graph-based prefetching algorithm for metadata servers in petabyte-scale storage systems , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).

[14]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[15]  Jim Griffioen,et al.  Reducing File System Latency using a Predictive Approach , 1994, USENIX Summer.

[16]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[17]  Feng Wang,et al.  File System Workload Analysis For Large Scale Scientific Com puting Applications , 2004 .

[18]  Marc Najork,et al.  Boxwood: Abstractions as the Foundation for Storage Infrastructure , 2004, OSDI.

[19]  Prashant Malik,et al.  Cassandra: structured storage system on a P2P network , 2009, PODC '09.

[20]  Erik Riedel,et al.  A Framework for Evaluating Storage System Security , 2002, FAST.

[21]  Joel H. Saltz,et al.  Active disks: programming model, algorithms and evaluation , 1998, ASPLOS VIII.

[22]  Thomas E. Anderson,et al.  A Comparison of File System Workloads , 2000, USENIX Annual Technical Conference, General Track.

[23]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[24]  Vagelis Hristidis,et al.  A Case for Self-Optimizing File Systems , 2006 .

[25]  Hayato Yamana,et al.  Generalized Sequential Pattern Mining with Item Intervals , 2006, J. Comput..

[26]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[27]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[28]  Thomas M. Kroeger,et al.  Predicting file system actions from prior events , 1996 .

[29]  Ahmed Amer,et al.  Identifying Stable File Access Patterns , 2004, MSST.

[30]  Hong Jiang,et al.  FARMER: A novel approach to file access correlation mining and evaluation reference model , 2008, HPDC '08.