Discovery of Application Workloads from Network File Traces

An understanding of application I/O access patterns is useful in several situations. First, gaining insight into what applications are doing with their data at a semantic level helps in designing efficient storage systems. Second, it helps create benchmarks that mimic realistic application behavior closely. Third, it enables autonomic systems as the information obtained can be used to adapt the system in a closed loop. All these use cases require the ability to extract the application-level semantics of I/O operations. Methods such as modifying application code to associate I/O operations with semantic tags are intrusive. It is well known that network file system traces are an important source of information that can be obtained non-intrusively and analyzed either online or offline. These traces are a sequence of primitive file system operations and their parameters. Simple counting, statistical analysis or deterministic search techniques are inadequate for discovering application-level semantics in the general case, because of the inherent variation and noise in realistic traces. In this paper, we describe a trace analysis methodology based on Profile Hidden Markov Models. We show that the methodology has powerful discriminatory capabilities that enable it to recognize applications based on the patterns in the traces, and to mark out regions in a long trace that encapsulate sets of primitive operations that represent higher-level application actions. It is robust enough that it can work around discrepancies between training and target traces such as in length and interleaving with other operations. We demonstrate the feasibility of recognizing patterns based on a small sampling of the trace, enabling faster trace analysis. Preliminary experiments show that the method is capable of learning accurate profile models on live traces in an online setting. We present a detailed evaluation of this methodology in a UNIX environment using NFS traces of selected commonly used applications such as compilations as well as on industrial strength benchmarks such as TPC-C and Postmark, and discuss its capabilities and limitations in the context of the use cases mentioned above.

[1]  Barak A. Pearlmutter,et al.  Detecting intrusions using system calls: alternative data models , 1999, Proceedings of the 1999 IEEE Symposium on Security and Privacy (Cat. No.99CB36344).

[2]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[3]  Daniel A. Reed,et al.  Input/output access pattern classification using hidden Markov models , 1997, IOPADS '97.

[4]  Gregory R. Ganger,et al.  Categorizing and differencing system behaviours , 2007 .

[5]  Dan Gusfield Algorithms on Strings, Trees, and Sequences: More Applications of Suffix Trees , 1997 .

[6]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[7]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[8]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[9]  Daniel A. Reed,et al.  Automatic ARIMA time series modeling for adaptive I/O prefetching , 2004, IEEE Transactions on Parallel and Distributed Systems.

[10]  栄 久米原,et al.  Wiresharkパケット解析リファレンス : Network Protocol Analyzer , 2009 .

[11]  Mary Baker,et al.  Measurements of a distributed file system , 1991, SOSP '91.

[12]  D. Haussler,et al.  Protein modeling using hidden Markov models: analysis of globins , 1993, [1993] Proceedings of the Twenty-sixth Hawaii International Conference on System Sciences.

[13]  Richard Mortier,et al.  Using Magpie for Request Extraction and Workload Modelling , 2004, OSDI.

[14]  Margo Seltzer,et al.  Trace-based analyses and optimizations for network storage servers , 2004 .

[15]  Margo I. Seltzer,et al.  Passive NFS Tracing of Email and Research Workloads , 2003, FAST.

[16]  Thomas E. Anderson,et al.  A Comparison of File System Workloads , 2000, USENIX Annual Technical Conference, General Track.

[17]  Shankar Pasupathy,et al.  Measurement and Analysis of Large-Scale Network File System Workloads , 2008, USENIX Annual Technical Conference.

[18]  H. Lehmann,et al.  Nucleic Acid Research , 1967 .

[19]  S. B. Needleman,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 1989 .

[20]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[21]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[22]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[23]  Eric Anderson,et al.  Capture, Conversion, and Analysis of an Intense NFS Workload , 2009, FAST.

[24]  Maxime Crochemore,et al.  Algorithms on strings , 2007 .

[25]  Brent Callaghan,et al.  NFS Version 3 Protocol Specification , 1995, RFC.

[26]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[27]  Jeffrey Katcher,et al.  PostMark: A New File System Benchmark , 1997 .

[28]  Margo I. Seltzer,et al.  File classification in self-* storage systems , 2004, International Conference on Autonomic Computing, 2004. Proceedings..

[29]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[30]  Margo I. Seltzer,et al.  New NFS Tracing Tools and Techniques for System Analysis , 2003, LISA.

[31]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.