PRISM: PRecision-Integrated Scalable Monitoring

This paper describes PRISM, a scalable monitoring service that makes imprecision a first-class abstraction for its scalable DHT-based aggregation service. Exposing imprecision is essential for both correctness in the face of network and node failures and scalability to large systems. PRISM introduces the notion of conditioned consistency that quantifies imprecision along a threedimensional vector: arithmetic imprecision (AI) bounds numeric inaccuracy, temporal imprecision (TI) bounds update delays, and network imprecision (NI) bounds uncertainty due to network and node failures. AI and TI balance precision against monitoring overhead for scalability while NI addresses the fundamental challenge of providing consistency guarantees despite failures in a large distributed system. Our implementation addresses the challenge of providing these metrics while scaling to a large numbers of nodes and attributes. By introducing a 10% AI, PRISM’s PlanetLab monitoring service, PrMon, can reduce network overheads by an order of magnitude compared to the currently-used CoMon service. And, by using NI metrics to automatically select the best of four redundant aggregation results, we can reduce the observed worst-case inaccuracy by nearly a factor of five.

[1]  M. Dahlin,et al.  Challenges for a Scalable Distributed Information Management System , 2004 .

[2]  Rajmohan Rajaraman,et al.  Accessing Nearby Copies of Replicated Objects in a Distributed Environment , 1997, SPAA '97.

[3]  Srinivasan Seshan,et al.  Mercury: supporting scalable multi-attribute range queries , 2004, SIGCOMM '04.

[4]  Ling Huang,et al.  Communication-Efficient Tracking of Distributed Cumulative Triggers , 2007, 27th International Conference on Distributed Computing Systems (ICDCS '07).

[5]  Sujata Banerjee,et al.  S3: a scalable sensing service for monitoring large networked systems , 2006, INM '06.

[6]  Paul Laskowski,et al.  Network monitors and contracting systems: competition and innovation , 2006, SIGCOMM 2006.

[7]  Scott Shenker,et al.  The Network Oracle , 2005, IEEE Data Eng. Bull..

[8]  Robert Tappan Morris,et al.  Serving DNS Using a Peer-to-Peer Lookup Service , 2002, IPTPS.

[9]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[10]  Praveen Yalagandula,et al.  A scalable distributed information management system , 2004, SIGCOMM 2004.

[11]  David R. Cheriton,et al.  Leases: an efficient fault-tolerant mechanism for distributed file cache consistency , 1989, SOSP '89.

[12]  David D. Clark,et al.  A knowledge plane for the internet , 2003, SIGCOMM '03.

[13]  Jennifer Widom,et al.  Adaptive filters for continuous queries over distributed data streams , 2003, SIGMOD '03.

[14]  Christopher Olston,et al.  Finding (recently) frequent items in distributed data streams , 2005, 21st International Conference on Data Engineering (ICDE'05).

[15]  Suman Nath,et al.  Tributaries and deltas: efficient and robust aggregation in sensor network streams , 2005, SIGMOD '05.

[16]  Michael Dahlin,et al.  End-to-end WAN service availability , 2001, TNET.

[17]  Scott Shenker,et al.  Group Therapy for Systems: Using Link Attestations to Manage Failures , 2006, IPTPS.

[18]  Cristian Estan,et al.  New directions in traffic measurement and accounting , 2001, IMW '01.

[19]  Ben Y. Zhao,et al.  An Infrastructure for Fault-tolerant Wide-area Location and Routing , 2001 .

[20]  Timothy Roscoe,et al.  Learning from PlanetLab , 2006 .

[21]  Deborah Estrin,et al.  Directed diffusion: a scalable and robust communication paradigm for sensor networks , 2000, MobiCom '00.

[22]  Mark Handley,et al.  A scalable content-addressable network , 2001, SIGCOMM '01.

[23]  Ling Huang,et al.  Toward sophisticated detection with distributed triggers , 2006, MineNet '06.

[24]  Lorenzo Alvisi,et al.  A framework for semantic reasoning about Byzantine quorum systems , 2001, PODC '01.

[25]  Michael Dahlin,et al.  Design considerations for distributed caching on the Internet , 1999, Proceedings. 19th IEEE International Conference on Distributed Computing Systems (Cat. No.99CB37003).

[26]  Jeffrey Considine,et al.  Approximate aggregation techniques for sensor databases , 2004, Proceedings. 20th International Conference on Data Engineering.

[27]  Deborah Estrin,et al.  Computing aggregates for monitoring wireless sensor networks , 2003, Proceedings of the First IEEE International Workshop on Sensor Network Protocols and Applications, 2003..

[28]  David Mazières,et al.  Sloppy Hashing and Self-Organizing Clusters , 2003, IPTPS.

[29]  Alexander Siegel Performance in flexible distributed file systems , 1992 .

[30]  Robert Tappan Morris,et al.  Resilient overlay networks , 2001, SOSP.

[31]  Peter Druschel,et al.  Pastry: Scalable, distributed object location and routing for large-scale peer-to- , 2001 .

[32]  Amin Vahdat,et al.  Design and evaluation of a continuous consistency model for replicated services , 2000, OSDI.

[33]  Srinivasan Seshan,et al.  Cache-and-query for wide area sensor databases , 2003, SIGMOD '03.

[34]  Jennifer Widom,et al.  Offering a Precision-Performance Tradeoff for Aggregation Queries over Replicated Data , 2000, VLDB.

[35]  Michael B. Jones,et al.  SkipNet: A Scalable Overlay Network with Practical Locality Properties , 2003, USENIX Symposium on Internet Technologies and Systems.

[36]  Scott Shenker,et al.  Querying the Internet with PIER , 2003, VLDB.

[37]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[38]  Larry L. Peterson,et al.  Sophia: an Information Plane for networked systems , 2004, Comput. Commun. Rev..

[39]  Mike Hibler,et al.  An integrated experimental environment for distributed systems and networks , 2002, OPSR.

[40]  Christopher Olston,et al.  Distributed top-k monitoring , 2003, SIGMOD '03.

[41]  Hari Balakrishnan,et al.  Resilient overlay networks , 2001, SOSP.

[42]  J. Hellerstein,et al.  A Wakeup Call for Internet Monitoring Systems : The Case for Distributed Triggers , 2004 .

[43]  Michael Dahlin,et al.  Hierarchical Cache Consistency in a WAN , 1999, USENIX Symposium on Internet Technologies and Systems.

[44]  Srinivasan Seshan,et al.  Mercury: supporting scalable multi-attribute range queries , 2004, SIGCOMM 2004.

[45]  Srinivasan Seshan,et al.  Synopsis diffusion for robust aggregation in sensor networks , 2008, ACM Trans. Sens. Networks.

[46]  Darryl Veitch,et al.  Robust synchronization of software clocks across the internet , 2004, IMC '04.

[47]  Srinivasan Seshan,et al.  Synopsis diffusion for robust aggregation in sensor networks , 2004, SenSys '04.

[48]  Indranil Gupta,et al.  Scalable fault-tolerant aggregation in large process groups , 2001, 2001 International Conference on Dependable Systems and Networks.

[49]  Indranil Gupta,et al.  Decentralized Schemes for Size Estimation in Large and Dynamic Groups , 2005, Fourth IEEE International Symposium on Network Computing and Applications.

[50]  Vern Paxson,et al.  End-to-end routing behavior in the Internet , 1996, TNET.

[51]  Matt Welsh,et al.  Hourglass: An Infrastructure for Connecting Sensor Networks and Applications , 2004 .

[52]  Robbert van Renesse,et al.  Astrolabe: A robust and scalable technology for distributed system monitoring, management, and data mining , 2003, TOCS.

[53]  Antony I. T. Rowstron,et al.  Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems , 2001, Middleware.

[54]  Wei Hong,et al.  Proceedings of the 5th Symposium on Operating Systems Design and Implementation Tag: a Tiny Aggregation Service for Ad-hoc Sensor Networks , 2022 .

[55]  Rajeev Motwani,et al.  The price of validity in dynamic networks , 2004, SIGMOD '04.

[56]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[57]  Nancy A. Lynch,et al.  Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services , 2002, SIGA.

[58]  Jessica K. Hodgins,et al.  Temporal notions of synchronization and consistency in Beehive , 1997, SPAA '97.

[59]  Amin Vahdat,et al.  SHARP: an architecture for secure resource peering , 2003, SOSP '03.

[60]  Amin Vahdat,et al.  Design and evaluation of a conit-based continuous consistency model for replicated services , 2002, TOCS.