Precise, Scalable, and Online Request Tracing for Multitier Services of Black Boxes

As more and more multitier services are developed from commercial off-the-shelf components or heterogeneous middleware without source code available, both developers and administrators need a request tracing tool to (1) exactly know how a user request of interest travels through services of black boxes and (2) obtain macrolevel user request behaviors of services without manually analyzing massive logs. This need is further exacerbated by IT system “agility,” which mandates the tracing tool to provide online performance data since offline approaches cannot reflect system changes in real time. Moreover, considering the large scale of deployed services, a pragmatic tracing approach should be scalable in terms of the cost in collecting and analyzing logs. In this paper, we introduce a precise, scalable, and online request tracing tool for multitier services of black boxes. Our contributions are threefold. First, we propose a precise request tracing algorithm for multitier services of black boxes, which only uses application-independent knowledge. Second, we present a microlevel abstraction, component activity graph, to represent causal paths of each request. On the basis of this abstraction, we use dominated causal path patterns to represent repeatedly executed causal paths that account for significant fractions, and we further present a derived performance metric of causal path patterns, latency percentages of components, to enable debugging performance-in-the-large. Third, we develop two mechanisms, tracing on demand and sampling, to significantly increase the system scalability. We implement a prototype of the proposed system, called PreciseTracer, and release it as open source code. In comparison with WAP5-a black-box tracing approach, PreciseTracer achieves higher tracing accuracy and faster response time. Our experimental results also show that PreciseTracer has low overhead, and still achieves high tracing accuracy even if an aggressive sampling policy is adopted, indicating that PreciseTracer is a promising tracing tool for large-scale production systems.

[1]  Karsten Schwan,et al.  E2EProf: Automated End-to-End Performance Management for Enterprise Systems , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[2]  William E. Johnston,et al.  The NetLogger Methodology for High Performance Distributed Systems Performance Analysis , 1999 .

[3]  Xiao Zhang,et al.  Hardware counter driven on-the-fly request signatures , 2008, ASPLOS.

[4]  Dan Meng,et al.  Automatic performance debugging of SPMD-style parallel programs , 2011, J. Parallel Distributed Comput..

[5]  Alan L. Cox,et al.  Causeway: Operating System Support for Controlling and Analyzing the Execution of Distributed Programs , 2005, HotOS.

[6]  Alan L. Cox,et al.  Transactional profiling for multi-tier applications , 2007 .

[7]  Alan L. Cox,et al.  Whodunit: transactional profiling for multi-tier applications , 2007, EuroSys '07.

[8]  Dan Meng,et al.  Transformer: A New Paradigm for Building Data-Parallel Programming Models , 2010, IEEE Micro.

[9]  Eric Koskinen,et al.  BorderPatrol: isolating events for black-box tracing , 2008, Eurosys '08.

[10]  W. Richard Stevens,et al.  UNIX Network Programming: Networking APIs: Sockets and XTI , 1997 .

[11]  Gang Lu,et al.  Characterization of real workloads of web search engines , 2011, 2011 IEEE International Symposium on Workload Characterization (IISWC).

[12]  LamportLeslie Time, clocks, and the ordering of events in a distributed system , 1978 .

[13]  Dan Meng,et al.  Precise request tracing and performance debugging for multi-tier services of black boxes , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.

[14]  Bryan Cantrill,et al.  Dynamic Instrumentation of Production Systems , 2004, USENIX Annual Technical Conference, General Track.

[15]  Julio César López-Hernández,et al.  Stardust: tracking activity in a distributed storage system , 2006, SIGMETRICS '06/Performance '06.

[16]  Joseph L. Hellerstein,et al.  ETE: a customizable approach to measuring end-to-end response times and their components in distributed systems , 1999, Proceedings. 19th IEEE International Conference on Distributed Computing Systems (Cat. No.99CB37003).

[17]  R. Krishnakumar Kernel korner: kprobes-a kernel debugger , 2005 .

[18]  Marcos K. Aguilera,et al.  Performance debugging for distributed systems of black boxes , 2003, SOSP '03.

[19]  Randy H. Katz,et al.  X-Trace: A Pervasive Network Tracing Framework , 2007, NSDI.

[20]  B.P. Miller DPM: A Measurement System for Distributed Programs , 1988, IEEE Trans. Computers.

[21]  Benny Rochwerger,et al.  Oceano-SLA based management of a computing utility , 2001, 2001 IEEE/IFIP International Symposium on Integrated Network Management Proceedings. Integrated Network Management VII. Integrated Management Strategies for the New Millennium (Cat. No.01EX470).

[22]  Christopher Stewart,et al.  Performance modeling and system management for multi-component online services , 2005, NSDI.

[23]  Richard Mortier,et al.  Using Magpie for Request Extraction and Workload Modelling , 2004, OSDI.

[24]  Lin Yuan,et al.  PowerTracer: Tracing requests in multi-tier services to save cluster power consumption , 2010, ArXiv.

[25]  William E. Johnston,et al.  The NetLogger methodology for high performance distributed systems performance analysis , 1998, Proceedings. The Seventh International Symposium on High Performance Distributed Computing (Cat. No.98TB100244).

[26]  Eric A. Brewer,et al.  Pinpoint: problem determination in large, dynamic Internet services , 2002, Proceedings International Conference on Dependable Systems and Networks.

[27]  Jianfeng Zhan,et al.  Decreasing log data of multi-tier services for effective request tracing , 2010, ArXiv.

[28]  Rajeev Gandhi,et al.  Visual, Log-Based Causal Tracing for Performance Debugging of MapReduce Systems , 2010, 2010 IEEE 30th International Conference on Distributed Computing Systems.

[29]  Yi Liang,et al.  In cloud, do MTC or HTC service providers benefit from the economies of scale? , 2009, MTAGS '09.

[30]  Chun Zhang,et al.  vPath: Precise Discovery of Request Processing Paths from Black-Box Observations of Thread and Network Activities , 2009, USENIX Annual Technical Conference.

[31]  Yi Liang,et al.  In Cloud, Can Scientific Communities Benefit from the Economies of Scale? , 2010, IEEE Transactions on Parallel and Distributed Systems.

[32]  WangLei,et al.  In Cloud, Can Scientific Communities Benefit from the Economies of Scale? , 2012 .

[33]  Gang Ren,et al.  Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers , 2010, IEEE Micro.

[34]  Anima Anandkumar,et al.  Tracking in a spaghetti bowl: monitoring transactions using footprints , 2008, SIGMETRICS '08.

[35]  Vivek S. Pai,et al.  Understanding and Addressing Blocking-Induced Network Server Latency , 2006, USENIX Annual Technical Conference, General Track.

[36]  Marcos K. Aguilera,et al.  WAP5: black-box performance debugging for wide-area systems , 2006, WWW '06.

[37]  Amin Vahdat,et al.  Pip: Detecting the Unexpected in Distributed Systems , 2006, NSDI.

[38]  Donald Beaver,et al.  Dapper, a Large-Scale Distributed Systems Tracing Infrastructure , 2010 .

[39]  Richard Mortier,et al.  Magpie: Online Modelling and Performance-aware Systems , 2003, HotOS.

[40]  Vivek S. Pai,et al.  Proceedings of the General Track: 2004 Usenix Annual Technical Conference Making the " Box " Transparent: System Call Performance as a First-class Result , 2022 .

[41]  David A. Patterson,et al.  Path-Based Failure and Evolution Management , 2004, NSDI.

[42]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.