Design implications for enterprise storage systems via multi-dimensional trace analysis

Enterprise storage systems are facing enormous challenges due to increasing growth and heterogeneity of the data stored. Designing future storage systems requires comprehensive insights that existing trace analysis methods are ill-equipped to supply. In this paper, we seek to provide such insights by using a new methodology that leverages an objective, multi-dimensional statistical technique to extract data access patterns from network storage system traces. We apply our method on two large-scale real-world production network storage system traces to obtain comprehensive access patterns and design insights at user, application, file, and directory levels. We derive simple, easily implementable, threshold-based design optimizations that enable efficient data placement and capacity optimization strategies for servers, consolidation policies for clients, and improved caching performance for both.

[1]  Mary Baker,et al.  Measurements of a distributed file system , 1991, SOSP '91.

[2]  Robert Tibshirani,et al.  An Introduction to the Bootstrap , 1994 .

[3]  William J. Bolosky,et al.  A large-scale study of file-system contents , 1999, SIGMETRICS '99.

[4]  Qi Zhang,et al.  Characterization of storage workload traces from production Windows Servers , 2008, 2008 IEEE International Symposium on Workload Characterization.

[5]  K. K. Ramakrishnan,et al.  Analysis of file I/O traces in commercial computing environments , 1992, SIGMETRICS '92/PERFORMANCE '92.

[6]  John A. Kunze,et al.  A trace-driven analysis of the UNIX 4.2 BSD file system , 1985, SOSP '85.

[7]  Werner Vogels,et al.  File system usage in Windows NT 4.0 , 1999, SOSP.

[8]  Eric A. Brewer,et al.  Self-similarity in file systems , 1998, SIGMETRICS '98/PERFORMANCE '98.

[9]  Jacob R. Lorch,et al.  A five-year study of file-system metadata , 2007, TOS.

[10]  Min Zhou,et al.  Analysis of personal computer workloads , 1999, MASCOTS '99. Proceedings of the Seventh International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[11]  Dawn Xiaodong Song,et al.  Design and Evaluation of a Real-Time URL Spam Filtering Service , 2011, 2011 IEEE Symposium on Security and Privacy.

[12]  Archana Ganapathi,et al.  Predicting Multiple Metrics for Queries: Better Decisions Enabled by Machine Learning , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[13]  Dutch T. Meyer,et al.  A study of practical deduplication , 2011, TOS.

[14]  Armando Fox,et al.  Fingerprinting the datacenter: automated classification of performance crises , 2010, EuroSys '10.

[15]  Margo I. Seltzer,et al.  Passive NFS Tracing of Email and Research Workloads , 2003, FAST.

[16]  Thomas E. Anderson,et al.  A Comparison of File System Workloads , 2000, USENIX Annual Technical Conference, General Track.

[17]  Shankar Pasupathy,et al.  Measurement and Analysis of Large-Scale Network File System Workloads , 2008, USENIX Annual Technical Conference.