Known Unknowns in Large-Scale System Monitoring

This paper addresses a central challenge in PRISM, a large-scale distributed monitoring system: coping with the uncertainties and ambiguities introduced by network and node failures. In particular, in a large scale monitoring system, such failures interact badly with techniques needed for scalability like hierarchy, arithmetic filterin g, and temporal batching. For example, if a monitoring subtree is silent over an interval, it is difficult to distinguis h between two cases: (a) the subtree has sent no updates because the inputs have not significantly changed or (b) the inputs have significantly changed but the subtree is unable to transmit its report. As a result, reported results can be arbitrarily far from their true values. To address this challenge PRISM introduces Network Imprecision(NI), a new metric to characterize accuracy despite node failures, network disruptions, and system reconfigurations. PRISM leverages NI to flag potentially inaccurate results, allowing applications to differentia te between known-correct and likely-erroneous results as well as to correct distorted results by applying several redundancy techniques. Evaluation of our PRISM prototype shows that NI effectively flags inaccurate query results while incurring low overheads, and we find that using NI to automatically select the best results can reduce the inaccuracy in a PRISM-based monitoring service by nearly a factor of five.

[1]  Sujata Banerjee,et al.  S3: a scalable sensing service for monitoring large networked systems , 2006, INM '06.

[2]  Srinivasan Seshan,et al.  Mercury: supporting scalable multi-attribute range queries , 2004, SIGCOMM 2004.

[3]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[4]  M. Dahlin,et al.  Challenges for a Scalable Distributed Information Management System , 2004 .

[5]  Srinivasan Seshan,et al.  Synopsis diffusion for robust aggregation in sensor networks , 2008, ACM Trans. Sens. Networks.

[6]  Haifeng Yu,et al.  DoS-resilient secure aggregation queries in sensor networks , 2007, PODC '07.

[7]  Indranil Gupta,et al.  Scalable fault-tolerant aggregation in large process groups , 2001, 2001 International Conference on Dependable Systems and Networks.

[8]  Yin Zhang,et al.  STAR: Self-Tuning Aggregation for Scalable Monitoring , 2007, VLDB.

[9]  Amin Vahdat,et al.  Design and evaluation of a continuous consistency model for replicated services , 2000, OSDI.

[10]  Deborah Estrin,et al.  Directed diffusion: a scalable and robust communication paradigm for sensor networks , 2000, MobiCom '00.

[11]  Mark Handley,et al.  A scalable content-addressable network , 2001, SIGCOMM '01.

[12]  Matt Welsh,et al.  Hourglass: An Infrastructure for Connecting Sensor Networks and Applications , 2004 .

[13]  David Mazières,et al.  Sloppy Hashing and Self-Organizing Clusters , 2003, IPTPS.

[14]  Robbert van Renesse,et al.  Astrolabe: A robust and scalable technology for distributed system monitoring, management, and data mining , 2003, TOCS.

[15]  Antony I. T. Rowstron,et al.  Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems , 2001, Middleware.

[16]  Scott Shenker,et al.  The Network Oracle , 2005, IEEE Data Eng. Bull..

[17]  Rajmohan Rajaraman,et al.  Accessing Nearby Copies of Replicated Objects in a Distributed Environment , 1999, Theory of Computing Systems.

[18]  Praveen Yalagandula,et al.  A scalable distributed information management system , 2004, SIGCOMM 2004.

[19]  Suman Nath,et al.  Tributaries and deltas: efficient and robust aggregation in sensor network streams , 2005, SIGMOD '05.

[20]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[21]  Wei Hong,et al.  Proceedings of the 5th Symposium on Operating Systems Design and Implementation Tag: a Tiny Aggregation Service for Ad-hoc Sensor Networks , 2022 .

[22]  Srinivasan Seshan,et al.  Mercury: supporting scalable multi-attribute range queries , 2004, SIGCOMM '04.

[23]  George Varghese,et al.  New directions in traffic measurement and accounting , 2002, CCRV.

[24]  Scott Shenker,et al.  Group Therapy for Systems: Using Link Attestations to Manage Failures , 2006, IPTPS.

[25]  Alexander Siegel Performance in flexible distributed file systems , 1992 .

[26]  Peter Druschel,et al.  Pastry: Scalable, distributed object location and routing for large-scale peer-to- , 2001 .

[27]  Ion Stoica,et al.  SAAR: A Shared Control Plane for Overlay Multicast , 2007, NSDI.

[28]  Christine Julien,et al.  Automatic consistency assessment for query results in dynamic environments , 2007, ESEC-FSE '07.

[29]  Ben Y. Zhao,et al.  An Infrastructure for Fault-tolerant Wide-area Location and Routing , 2001 .

[30]  Srinivasan Seshan,et al.  Synopsis diffusion for robust aggregation in sensor networks , 2004, SenSys '04.

[31]  Michael B. Jones,et al.  SkipNet: A Scalable Overlay Network with Practical Locality Properties , 2003, USENIX Symposium on Internet Technologies and Systems.

[32]  Mike Hibler,et al.  An integrated experimental environment for distributed systems and networks , 2002, OPSR.

[33]  Kamesh Munagala,et al.  Suppression and failures in sensor networks: a Bayesian approach , 2007, VLDB 2007.

[34]  David R. Cheriton,et al.  Leases: an efficient fault-tolerant mechanism for distributed file cache consistency , 1989, SOSP '89.

[35]  David D. Clark,et al.  A knowledge plane for the internet , 2003, SIGCOMM '03.

[36]  Jennifer Widom,et al.  Adaptive filters for continuous queries over distributed data streams , 2003, SIGMOD '03.

[37]  Rajeev Motwani,et al.  The price of validity in dynamic networks , 2004, SIGMOD '04.

[38]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[39]  Nancy A. Lynch,et al.  Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services , 2002, SIGA.

[40]  Jessica K. Hodgins,et al.  Temporal notions of synchronization and consistency in Beehive , 1997, SPAA '97.

[41]  Michael Dahlin,et al.  Design considerations for distributed caching on the Internet , 1999, Proceedings. 19th IEEE International Conference on Distributed Computing Systems (Cat. No.99CB37003).

[42]  Jeffrey Considine,et al.  Approximate aggregation techniques for sensor databases , 2004, Proceedings. 20th International Conference on Data Engineering.

[43]  Deborah Estrin,et al.  Computing aggregates for monitoring wireless sensor networks , 2003, Proceedings of the First IEEE International Workshop on Sensor Network Protocols and Applications, 2003..

[44]  Robert Morris,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM 2001.

[45]  Mike Hibler,et al.  An integrated experimental environment for distributed systems and networks , 2002, OSDI '02.

[46]  Scott Shenker,et al.  Querying the Internet with PIER , 2003, VLDB.

[47]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[48]  Michael Dahlin,et al.  Hierarchical Cache Consistency in a WAN , 1999, USENIX Symposium on Internet Technologies and Systems.