Pivot tracing

Monitoring and troubleshooting distributed systems is notoriously difficult; potential problems are complex, varied, and unpredictable. The monitoring and diagnosis tools commonly used today -- logs, counters, and metrics -- have two important limitations: what gets recorded is defined a priori, and the information is recorded in a component- or machine-centric way, making it extremely hard to correlate events that cross these boundaries. This paper presents Pivot Tracing, a monitoring framework for distributed systems that addresses both limitations by combining dynamic instrumentation with a novel relational operator: the happened-before join. Pivot Tracing gives users, at runtime, the ability to define arbitrary metrics at one point of the system, while being able to select, filter, and group by events meaningful at other parts of the system, even when crossing component or machine boundaries. We have implemented a prototype of Pivot Tracing for Java-based systems and evaluate it on a heterogeneous Hadoop cluster comprising HDFS, HBase, MapReduce, and YARN. We show that Pivot Tracing can effectively identify a diverse range of root causes such as software bugs, misconfiguration, and limping hardware. We show that Pivot Tracing is dynamic, extensible, and enables cross-tier analysis between inter-operating applications, with low execution overhead.

[1]  Amin Vahdat,et al.  Pip: Detecting the Unexpected in Distributed Systems , 2006, NSDI.

[2]  Richard Mortier,et al.  Magpie: Online Modelling and Performance-aware Systems , 2003, HotOS.

[3]  Peter T. Wood,et al.  Query languages for graph databases , 2012, SGMD.

[4]  Yuriy Brun,et al.  Inferring models of concurrent systems from logs of their behavior with CSight , 2014, ICSE.

[5]  Raghu Ramakrishnan,et al.  Database Management Systems , 1976 .

[6]  David A. Patterson,et al.  Path-Based Failure and Evolution Management , 2004, NSDI.

[7]  Xiao Ma,et al.  An empirical study on configuration errors in commercial and open source systems , 2011, SOSP.

[8]  Haoxiang Lin,et al.  G2: A Graph Processing System for Diagnosing Distributed Systems , 2011, USENIX Annual Technical Conference.

[9]  Andrew C. Myers,et al.  A decentralized model for information flow control , 1997, SOSP.

[10]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[11]  Ding Yuan,et al.  Improving Software Diagnosability via Log Enhancement , 2012, TOCS.

[12]  Yixin Chen,et al.  Stream Cube: An Architecture for Multi-Dimensional Analysis of Data Streams , 2005, Distributed and Parallel Databases.

[13]  Brad Chen,et al.  Locating System Problems Using Dynamic Instrumentation , 2010 .

[14]  Donald Beaver,et al.  Dapper, a Large-Scale Distributed Systems Tracing Infrastructure , 2010 .

[15]  Alan L. Cox,et al.  Causeway: Support for Controlling and Analyzing the Execution of Multi-tier Applications , 2005, Middleware.

[16]  Ryan Roelke Brown Dynamic Causal Monitoring for Distributed Systems , 2015 .

[17]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[18]  Mona Attariyan,et al.  X-ray: Automating Root-Cause Diagnosis of Performance Anomalies in Production Software , 2012, OSDI.

[19]  Gregory R. Ganger,et al.  Diagnosing Performance Changes by Comparing Request Flows , 2011, NSDI.

[20]  D. Ford,et al.  Hidden in plain sight , 1992 .

[21]  Rodrigo Fonseca,et al.  So , youwant to trace your distributed system ? Key design insights from years of practical experience , 2014 .

[22]  Dejan S. Milojicic,et al.  Moara: Flexible and Scalable Group-Based Querying System , 2008, Middleware.

[23]  Randy H. Katz,et al.  X-Trace: A Pervasive Network Tracing Framework , 2007, NSDI.

[24]  J. Flinn,et al.  Automatic Root-cause Diagnosis of Performance Anomalies in Production Software , 2011 .

[25]  Ratul Mahajan,et al.  Timecard: controlling user-perceived delays in server-based mobile applications , 2013, SOSP.

[26]  Zhenbang Chen,et al.  MTracer: A Trace-Oriented Monitoring Framework for Medium-Scale Distributed Systems , 2014, 2014 IEEE 8th International Symposium on Service Oriented System Engineering.

[27]  Michael I. Jordan,et al.  Detecting large-scale system problems by mining console logs , 2009, SOSP '09.

[28]  Zhenbang Chen,et al.  Automatic Detecting Performance Bugs in Cloud Computing Systems via Learning Latency Specification Model , 2014, 2014 IEEE 8th International Symposium on Service Oriented System Engineering.

[29]  Margo I. Seltzer,et al.  Provenance for the Cloud , 2010, FAST.

[30]  Vanish Talwar,et al.  VScope: Middleware for Troubleshooting Time-Sensitive Data Center Applications , 2012, Middleware.

[31]  Richard Mortier,et al.  Using Magpie for Request Extraction and Workload Modelling , 2004, OSDI.

[32]  Jennifer Neville,et al.  Structured Comparative Analysis of Systems Logs to Diagnose Performance Problems , 2012, NSDI.

[33]  Marcos K. Aguilera,et al.  Performance debugging for distributed systems of black boxes , 2003, SOSP '03.

[34]  Kang G. Shin,et al.  Stateful distributed interposition , 2004, TOCS.

[35]  Thomas F. Wenisch,et al.  The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services , 2014, OSDI.

[36]  Peter Bodík Overview of the workshop on managing large-scale systems via the analysis of system logs and the application of machine learning techniques , 2012, OPSR.

[37]  Gang Yin,et al.  An online service-oriented performance profiling tool for cloud computing systems , 2013, Frontiers of Computer Science.

[38]  Julio César López-Hernández,et al.  Stardust: tracking activity in a distributed storage system , 2006, SIGMETRICS '06/Performance '06.

[39]  Karsten Schwan,et al.  Towards Combining Online & Offline Management for Big Data Applications , 2014, ICAC.

[40]  William G. Griswold,et al.  An Overview of AspectJ , 2001, ECOOP.

[41]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[42]  Byung-Gon Chun,et al.  TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones , 2010, OSDI.

[43]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[44]  Michael Chow,et al.  This Paper Is Included in the Proceedings of the 12th Usenix Symposium on Operating Systems Design and Implementation (osdi '16). Dqbarge: Improving Data-quality Tradeoffs in Large-scale Internet Services Dqbarge: Improving Data-quality Tradeoffs in Large-scale Internet Services , 2022 .

[45]  Shashi Shekhar,et al.  QUIRE: Lightweight Provenance for Smart Phone Operating Systems , 2011, USENIX Security Symposium.

[46]  P. S. Almeida,et al.  Interval Tree Clocks : A Logical Clock for Dynamic Systems , 2008 .

[47]  Rajeev Gandhi,et al.  Performance troubleshooting in data centers: an annotated bibliography? , 2013, OPSR.

[48]  Jie Huang,et al.  The HiBench benchmark suite: Characterization of the MapReduce-based data analysis , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[49]  Tanakorn Leesatapornwongsa,et al.  Limplock: understanding the impact of limpware on scale-out cloud systems , 2013, SoCC.

[50]  Alexander Aiken,et al.  Using correlated surprise to infer shared influence , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).

[51]  Alan L. Cox,et al.  Whodunit: transactional profiling for multi-tier applications , 2007, EuroSys '07.

[52]  Gregor Kiczales,et al.  Aspect-oriented programming , 2001, ESEC/FSE-9.

[53]  Bryan Cantrill,et al.  Dynamic Instrumentation of Production Systems , 2004, USENIX Annual Technical Conference, General Track.

[54]  Huaimin Wang,et al.  Toward Fine-Grained, Unsupervised, Scalable Performance Diagnosis for Production Cloud Computing Systems , 2013, IEEE Transactions on Parallel and Distributed Systems.

[55]  Randy H. Katz,et al.  How Hadoop Clusters Break , 2013, IEEE Software.

[56]  Wei Xu,et al.  Advances and challenges in log analysis , 2011, Commun. ACM.

[57]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[58]  Alfonso F. Cardenas,et al.  Data base management systems (2nd ed.) , 1985 .

[59]  Heng Yin,et al.  DroidScope: Seamlessly Reconstructing the OS and Dalvik Semantic Views for Dynamic Android Malware Analysis , 2012, USENIX Security Symposium.

[60]  Eric A. Brewer,et al.  Pinpoint: problem determination in large, dynamic Internet services , 2002, Proceedings International Conference on Dependable Systems and Networks.

[61]  P. S. Almeida,et al.  Interval Tree Clocks , 2008, OPODIS.

[62]  Sudipto Guha,et al.  Modeling the Parallel Execution of Black-Box Services , 2011, HotCloud.

[63]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[64]  Rajeev Gandhi,et al.  Draco: Statistical diagnosis of chronic problems in large distributed systems , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[65]  Robbert van Renesse,et al.  Astrolabe: A robust and scalable technology for distributed system monitoring, management, and data mining , 2003, TOCS.

[66]  Yu Luo,et al.  lprof: A Non-intrusive Request Flow Profiler for Distributed Systems , 2014, OSDI.

[67]  David E. Culler,et al.  The ganglia distributed monitoring system: design, implementation, and experience , 2004, Parallel Comput..

[68]  Úlfar Erlingsson,et al.  Fay: extensible distributed tracing from kernels to clusters , 2011, SOSP '11.

[69]  Rodrigo Fonseca,et al.  Retro: Targeted Resource Management in Multi-tenant Distributed Systems , 2015, NSDI.

[70]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[71]  Brian Beckman,et al.  LINQ: reconciling object, relations and XML in the .NET framework , 2006, SIGMOD Conference.

[72]  Anees Shaikh,et al.  Performance Isolation and Fairness for Multi-Tenant Cloud Storage , 2012, OSDI.

[73]  Alley Stoughton,et al.  Detection of Mutual Inconsistency in Distributed Systems , 1983, IEEE Transactions on Software Engineering.

[74]  Gideon S. Mann,et al.  Diagnosing Latency in Multi-Tier Black-Box Services , 2011 .

[75]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[76]  Sam Shah,et al.  Root cause detection in a service-oriented architecture , 2013, SIGMETRICS '13.