Flight data recorder: monitoring persistent-state interactions to improve systems management

Mismanagement of the persistent state of a system---all the executable files, configuration settings and other data that govern how a system functions---causes reliability problems, security vulnerabilities, and drives up operation costs. Recent research traces persistent state interactions---how state is read, modified, etc.---to help troubleshooting, change management and malware mitigation, but has been limited by the difficulty of collecting, storing, and analyzing the 10s to 100s of millions of daily events that occur on a single machine, much less the 1000s or more machines in many computing environments. We present the Flight Data Recorder (FDR) that enables always-on tracing, storage and analysis of persistent state interactions. FDR uses a domain-specific log format, tailored to observed file system workloads and common systems management queries. Our lossless log format compresses logs to only 0.5--0.9 bytes per interaction. In this log format, 1000 machine-days of logs---over 25 billion events---can be analyzed in less than 30 minutes. We report on our deployment of FDR to 207 production machines at MSN, and show that a single centralized collection machine can potentially scale to collecting and analyzing the complete records of persistent state interactions from 4000+ machines. Furthermore, our tracing technology is shipping as part of the Windows Vista OS.

[1]  William J. Bolosky,et al.  A large-scale study of file-system contents , 1999, SIGMETRICS '99.

[2]  Steven D. Gribble,et al.  A Crawler-based Study of Spyware in the Web , 2006, NDSS.

[3]  Alan Jay Smith,et al.  Characteristics of I/O traffic in personal computer and server workloads , 2002, IBM Syst. J..

[4]  Archana Ganapathi,et al.  Why Do Internet Services Fail, and What Can Be Done About It? , 2002, USENIX Symposium on Internet Technologies and Systems.

[5]  K. K. Ramakrishnan,et al.  Analysis of file I/O traces in commercial computing environments , 1992, SIGMETRICS '92/PERFORMANCE '92.

[6]  J. Larus Whole program paths , 1999, PLDI '99.

[7]  Alan Jay Smith,et al.  The VTrace tool: building a system tracer for Windows NT and Windows 2000 , 2000 .

[8]  M. F.,et al.  Bibliography , 1985, Experimental Gerontology.

[9]  Johannes Gehrke,et al.  Query optimization in compressed database systems , 2001, SIGMOD '01.

[10]  Martin Burtscher,et al.  VPC3: a fast and effective trace-compression algorithm , 2004, SIGMETRICS '04/Performance '04.

[11]  Leslie Lamport,et al.  Password authentication with insecure communication , 1981, CACM.

[12]  Shan Lu,et al.  Analyzing persistent state interactions to improve state management , 2006, SIGMETRICS '06/Performance '06.

[13]  Margo I. Seltzer,et al.  Passive NFS Tracing of Email and Research Workloads , 2003, FAST.

[14]  Thomas E. Anderson,et al.  A Comparison of File System Workloads , 2000, USENIX Annual Technical Conference, General Track.

[15]  Samuel T. King,et al.  ReVirt: enabling intrusion analysis through virtual-machine logging and replay , 2002, OPSR.

[16]  Craig A. N. Soules,et al.  Metadata Efficiency in a Comprehensive Versioning File System (CMU-CS-02-145) , 2002 .

[17]  Tzi-cker Chiueh,et al.  Design, implementation, and evaluation of repairable file service , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[18]  Eric A. Brewer,et al.  Self-similarity in file systems , 1998, SIGMETRICS '98/PERFORMANCE '98.

[19]  Helen J. Wang,et al.  Strider: a black-box, state-based approach to change and configuration management and support , 2003, Sci. Comput. Program..

[20]  John Dunagan,et al.  Towards a self-managing software patching process using black-box persistent-state manifests , 2004 .

[21]  Wu-chi Feng,et al.  Forensix: a robust, high-performance reconstruction system , 2005, 25th IEEE International Conference on Distributed Computing Systems Workshops.

[22]  Helen J. Wang,et al.  Automatic Misconfiguration Troubleshooting with PeerPressure , 2004, OSDI.

[23]  W. Vogels File system usage in Windows NT 4.0 , 2000, OPSR.

[24]  Andrea C. Arpaci-Dusseau,et al.  Information and control in gray-box systems , 2001, SOSP.

[25]  William A. Arbaugh,et al.  IEEE 52 Computer , 1985 .

[26]  Xuxian Jiang,et al.  Automated Web Patrol with Strider HoneyMonkeys: Finding Web Sites That Exploit Browser Vulnerabilities , 2006, NDSS.

[27]  Werner Vogels,et al.  File system usage in Windows NT 4.0 , 1999, SOSP.

[28]  Evi Nemeth,et al.  DNS measurements at a root server , 2001, GLOBECOM'01. IEEE Global Telecommunications Conference (Cat. No.01CH37270).

[29]  Ratul Mahajan,et al.  Understanding BGP misconfiguration , 2002, SIGCOMM 2002.

[30]  Steve W. Manzuik,et al.  Windows of Vulnerability , 2006 .

[31]  John Wilkes,et al.  UNIX Disk Access Patterns , 1993, USENIX Winter.

[32]  E. F. Codd,et al.  A relational model of data for large shared data banks , 1970, CACM.

[33]  E. F. Codd,et al.  A Relational Model for Large Shared Data Banks , 1970 .

[34]  Amin Vahdat,et al.  Pip: Detecting the Unexpected in Distributed Systems , 2006, NSDI.

[35]  Sy-Yen Kuo,et al.  Gatekeeper: Monitoring Auto-Start Extensibility Points (ASEPs) for Spyware Management , 2004, LISA.

[36]  Peter A. Dinda,et al.  Wayback: A User-level Versioning File System for Linux (Awarded Best Paper!) , 2004, USENIX Annual Technical Conference, FREENIX Track.

[37]  Helen J. Wang,et al.  Shield: vulnerability-driven network filters for preventing known vulnerability exploits , 2004, SIGCOMM 2004.

[38]  Craig A. N. Soules,et al.  Metadata Efficiency in Versioning File Systems , 2003, FAST.

[39]  John Dunagan,et al.  Towards a self-managing software patching process using black-box persistent-state manifests , 2004, International Conference on Autonomic Computing, 2004. Proceedings..

[40]  Eric Rescorla Security Holes . . . Who Cares? , 2003, USENIX Security Symposium.

[41]  Mary Baker,et al.  Measurements of a distributed file system , 1991, SOSP '91.

[42]  Steven D. Gribble,et al.  Configuration Debugging as Search: Finding the Needle in the Haystack , 2004, OSDI.