High-level causal request traces are of interest to developers of large concurrent and distributed applications. These traces show how a request is processed as it passes through several modules which may be processes, threads, machines, or devices. They aid programmer understanding and are increasingly analyzed by tools used to detect performance and correctness errors. Precise traces are more useful than statistical approaches because they can detect anomalous behavior and allow decisions at run-time. Since these traces are difficult to obtain without application-specific instrumentation of each module of the system, much of the recent work that analyzes request traces is limited to applications for which source code and developer expertise is available. We present BorderPatrol, which obtains precise request traces through systems built from a litany of unmodified modules, written in varied languages, with varying architectures. These include Apache, thttpd, PostgreSQL, TurboGears, BIND and notably Zeus, a closed-source event-driven HTTP/1.1 web server, which uses helper processes. BorderPatrol obtains these traces using active observation which slightly modifies the event stream observed by system modules, simplifying precise observation. Protocol processors aid active observation by leveraging knowledge about standard protocols and interfaces between concurrent modules, avoiding the need for implementation-specific instrumentation. BorderPatrol obtains precise traces for black-box systems that cannot be traced by any other technique. Further, it does so with limited overhead on real systems (approximately 10-15%) making it a viable option for deployment on produc-
[1]
Alan L. Cox,et al.
Whodunit: transactional profiling for multi-tier applications
,
2007,
EuroSys '07.
[2]
Julio César López-Hernández,et al.
Stardust: tracking activity in a distributed storage system
,
2006,
SIGMETRICS '06/Performance '06.
[3]
Amin Vahdat,et al.
Pip: Detecting the Unexpected in Distributed Systems
,
2006,
NSDI.
[4]
Anant Agarwal,et al.
TraceBack: first fault diagnosis by reconstruction of distributed control flow
,
2005,
PLDI '05.
[5]
Alan L. Cox,et al.
Causeway: Operating System Support for Controlling and Analyzing the Execution of Distributed Programs
,
2005,
HotOS.
[6]
Richard Mortier,et al.
Using Magpie for Request Extraction and Workload Modelling
,
2004,
OSDI.
[7]
Richard Mortier,et al.
Request extraction in Magpie: events, schemas and temporal joins
,
2004,
EW 11.
[8]
Marcos K. Aguilera,et al.
Performance debugging for distributed systems of black boxes
,
2003,
SOSP '03.
[9]
Rebecca Isaacs,et al.
Performance analysis in loosely-coupled distributed systems
,
2002
.
[10]
Eric A. Brewer,et al.
Pinpoint: problem determination in large, dynamic Internet services
,
2002,
Proceedings International Conference on Dependable Systems and Networks.
[11]
David Mazières,et al.
A Toolkit for User-Level File Systems
,
2001,
USENIX Annual Technical Conference, General Track.
[12]
Willy Zwaenepoel,et al.
Flash: An efficient and portable Web server
,
1999,
USENIX Annual Technical Conference, General Track.