A Scalable Self-Diagnosing Content Distribution Service With Bounded Latency

Providing contractual performance assurances in distributed systems is an important and challenging problem. From the users' perspective, stringent timing requirements are becoming more critical. Meanwhile, from the system engineers' perspective, distributed systems are driven towards an increasingly larger scale, more integration and higher complexity, making predictable system performance difficult. In this dissertation, we present the design, implementation, and evaluation of a scalable self-diagnosing content distribution service that provides global bounded latencies on content access. Our solution firstly involves a decentralized replication scheme that dynamically selects subsets of the content distribution servers in wide-area networks for different classes of content so that per-class network latency bounds are achieved. The replication decisions are made autonomously by the servers based on dynamically measured network latencies and workload conditions. The content replication proceeds in a way that balances workload among servers, hence fully utilizing system capacity and avoiding latency bound violations. The efficiency and decentralized nature of the replication scheme enables our solution to scale up to very large scale content distribution networks. The self-diagnosing capability of our service comes from the scalable learning-based performance problem diagnosis techniques we propose. The increasing complexity of systems has motivated design of machine learning approaches to automate some system management tasks. However, with increase in scale, current approaches suffer from serious scalability issues. We present two scalable learning-based techniques that automatically identify probable causes of performance problems in large server systems with multiple tiers and replicated sites. By incorporating a large number of diagnostic information sources using a temporal segmentation mechanism and applying transfer learning techniques, we achieve both scalability and improved diagnosis accuracy.

[1]  Michael Dahlin,et al.  Engineering server-driven consistency for large scale dynamic Web services , 2001, WWW '01.

[2]  Richard Mortier,et al.  Using Magpie for Request Extraction and Workload Modelling , 2004, OSDI.

[3]  Theodore P. Baker,et al.  Multiprocessor EDF and deadline monotonic schedulability analysis , 2003, RTSS 2003. 24th IEEE Real-Time Systems Symposium, 2003.

[4]  Ion Stoica,et al.  Implementing declarative overlays , 2005, SOSP '05.

[5]  Peter Triantafillou,et al.  Towards High Performance Peer-to-Peer Content and Resource Sharing Systems , 2003, CIDR.

[6]  Willy Zwaenepoel,et al.  Cluster reserves: a mechanism for resource management in cluster-based network servers , 2000, SIGMETRICS '00.

[7]  Pavlin Radoslavov,et al.  Topology-informed Internet replica placement , 2002, Comput. Commun..

[8]  Indranil Gupta,et al.  MON: On-Demand Overlays for Distributed System Management , 2005, WORLDS.

[9]  Tarek F. Abdelzaher,et al.  Towards content distribution networks with latency guarantees , 2004, Twelfth IEEE International Workshop on Quality of Service, 2004. IWQOS 2004..

[10]  Sanjoy Dasgupta,et al.  Experiments with Random Projection , 2000, UAI.

[11]  Larry L. Peterson,et al.  Reliability and Security in the CoDeeN Content Distribution Network , 2004, USENIX Annual Technical Conference, General Track.

[12]  Sanjoy K. Baruah Task Partitioning Upon Heterogeneous Multiprocessor Platforms , 2004, IEEE Real-Time and Embedded Technology and Applications Symposium.

[13]  Jussi Kangasharju,et al.  Object replication strategies in content distribution networks , 2002, Comput. Commun..

[14]  Tarek F. Abdelzaher,et al.  Web Content Adaptation to Improve Server Overload Behavior , 1999, Comput. Networks.

[15]  J. J. Garcia-Luna-Aceves,et al.  A new approach to channel access scheduling for Ad Hoc networks , 2001, MobiCom '01.

[16]  Sang Hyuk Son,et al.  Load balancing in bounded-latency content distribution , 2005, 26th IEEE International Real-Time Systems Symposium (RTSS'05).

[17]  Archana Ganapathi,et al.  Why Do Internet Services Fail, and What Can Be Done About It? , 2002, USENIX Symposium on Internet Technologies and Systems.

[18]  Armando Fox,et al.  Capturing, indexing, clustering, and retrieving system history , 2005, SOSP '05.

[19]  Brad Cain,et al.  Known Content Network (CN) Request-Routing Mechanisms , 2003, RFC.

[20]  Armando Fox,et al.  Detecting application-level failures in component-based Internet services , 2005, IEEE Transactions on Neural Networks.

[21]  Dong-Ik Oh,et al.  Utilization Bounds for N-Processor Rate Monotone Scheduling with Static Processor Assignment , 1998, Real-Time Systems.

[22]  Li Fan,et al.  Web caching and Zipf-like distributions: evidence and implications , 1999, IEEE INFOCOM '99. Conference on Computer Communications. Proceedings. Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies. The Future is Now (Cat. No.99CH36320).

[23]  Kang G. Shin,et al.  Period-Based Load Partitioning and Assignment for Large Real-Time Applications , 2000, IEEE Trans. Computers.

[24]  Vasek Chvátal,et al.  A Greedy Heuristic for the Set-Covering Problem , 1979, Math. Oper. Res..

[25]  Krishna P. Gummadi,et al.  An analysis of Internet content delivery systems , 2002, OPSR.

[26]  Helen J. Wang,et al.  Resilient peer-to-peer streaming , 2003, 11th IEEE International Conference on Network Protocols, 2003. Proceedings..

[27]  Dinesh C. Verma,et al.  Content Distribution Networks , 2002 .

[28]  Robbert van Renesse,et al.  Astrolabe: A robust and scalable technology for distributed system monitoring, management, and data mining , 2003, TOCS.

[29]  Klara Nahrstedt,et al.  Optimal resource allocation in overlay multicast , 2003, 11th IEEE International Conference on Network Protocols, 2003. Proceedings..

[30]  I. Jolliffe Principal Component Analysis , 2002 .

[31]  Oscar H. Ibarra,et al.  Heuristic Algorithms for Scheduling Independent Tasks on Nonidentical Processors , 1977, JACM.

[32]  Giuseppe Lipari,et al.  Improved schedulability analysis of EDF on multiprocessor platforms , 2005, 17th Euromicro Conference on Real-Time Systems (ECRTS'05).

[33]  Antony I. T. Rowstron,et al.  Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems , 2001, Middleware.

[34]  Paul Barford,et al.  Generating representative Web workloads for network and server performance evaluation , 1998, SIGMETRICS '98/PERFORMANCE '98.

[35]  Robin Fairbairns,et al.  The Design and Implementation of an Operating System to Support Distributed Multimedia Applications , 1996, IEEE J. Sel. Areas Commun..

[36]  Bo Li,et al.  On the optimal placement of web proxies in the Internet , 1999, IEEE INFOCOM '99. Conference on Computer Communications. Proceedings. Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies. The Future is Now (Cat. No.99CH36320).

[37]  Douglas M. Freimuth,et al.  Kernel Mechanisms for Service Differentiation in Overloaded Web Servers , 2001, USENIX Annual Technical Conference, General Track.

[38]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[39]  Krithi Ramamritham,et al.  Distributed Scheduling of Tasks with Deadlines and Resource Requirements , 1989, IEEE Trans. Computers.

[40]  Kang G. Shin,et al.  Adaptive control of virtualized resources in utility computing environments , 2007, EuroSys '07.

[41]  Ludmila Cherkasova,et al.  Session Based Admission Control: A Mechanism for Improving the Performance of an Overloaded Web Server , 1998 .

[42]  George Candea,et al.  Combining Visualization and Statistical Analysis to Improve Operator Confidence and Efficiency for Failure Detection and Localization , 2005, Second International Conference on Autonomic Computing (ICAC'05).

[43]  Atul Singh,et al.  Using queries for distributed monitoring and forensics , 2006, EuroSys.

[44]  Amin Vahdat,et al.  Pip: Detecting the Unexpected in Distributed Systems , 2006, NSDI.

[45]  David Mosberger,et al.  httperf—a tool for measuring web server performance , 1998, PERV.

[46]  Arindam Banerjee,et al.  Probabilistic Semi-Supervised Clustering with Constraints , 2006, Semi-Supervised Learning.

[47]  Zygmunt J. Haas,et al.  Virtual backbone generation and maintenance in ad hoc network mobility management , 2000, Proceedings IEEE INFOCOM 2000. Conference on Computer Communications. Nineteenth Annual Joint Conference of the IEEE Computer and Communications Societies (Cat. No.00CH37064).

[48]  Roger Wattenhofer,et al.  Constant Time Distributed Dominating Set Approximation , 2022 .

[49]  Richard M. Karp,et al.  Load balancing in dynamic structured P2P systems , 2004, IEEE INFOCOM 2004.

[50]  Indranil Gupta,et al.  MMC01-6: QoS-aware Object Replication in Overlay Networks , 2006, IEEE Globecom 2006.

[51]  Amin Vahdat,et al.  Bullet: high bandwidth data dissemination using an overlay mesh , 2003, SOSP '03.

[52]  Robert Morris,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM 2001.

[53]  Dan Rubenstein,et al.  Distributed, self-stabilizing placement of replicated resources in emerging networks , 2005, 11th IEEE International Conference on Network Protocols, 2003. Proceedings..

[54]  Arun Venkataramani,et al.  Bandwidth constrained placement in a WAN , 2001, PODC '01.

[55]  Philip S. Yu,et al.  Dynamic load balancing in geographically distributed heterogeneous Web servers , 1998, Proceedings. 18th International Conference on Distributed Computing Systems (Cat. No.98CB36183).

[56]  Vanish Talwar,et al.  A resource allocation architecture with support for interactive sessions in utility Grids , 2004, IEEE International Symposium on Cluster Computing and the Grid, 2004. CCGrid 2004..

[57]  Lili Qiu,et al.  On the placement of Web server replicas , 2001, Proceedings IEEE INFOCOM 2001. Conference on Computer Communications. Twentieth Annual Joint Conference of the IEEE Computer and Communications Society (Cat. No.01CH37213).

[58]  Jeffrey S. Chase,et al.  Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control , 2004, OSDI.

[59]  Steve Muir The Seven Deadly Sins of Distributed Systems , 2004, WORLDS.

[60]  Ellen W. Zegura,et al.  A novel server selection technique for improving the response time of a replicated service , 1998, Proceedings. IEEE INFOCOM '98, the Conference on Computer Communications. Seventeenth Annual Joint Conference of the IEEE Computer and Communications Societies. Gateway to the 21st Century (Cat. No.98.

[61]  Richard P. Martin,et al.  Understanding and Dealing with Operator Mistakes in Internet Services , 2004, OSDI.

[62]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[63]  Michel Dagenais,et al.  Measuring and Characterizing System Behavior Using Kernel-Level Event Logging , 2000, USENIX Annual Technical Conference, General Track.

[64]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[65]  Tarek F. Abdelzaher,et al.  Bounded-latency content distribution feasibility and evaluation , 2005, IEEE Transactions on Computers.

[66]  Lui Sha,et al.  Queueing model based network server performance control , 2002, 23rd IEEE Real-Time Systems Symposium, 2002. RTSS 2002..

[67]  Balachander Krishnamurthy,et al.  On the use and performance of content distribution networks , 2001, IMW '01.

[68]  Peter Steenkiste,et al.  Evaluation and characterization of available bandwidth probing techniques , 2003, IEEE J. Sel. Areas Commun..

[69]  Larry L. Peterson,et al.  Proceedings of the 5th Symposium on Operating Systems Design and Implementation the Effectiveness of Request Redirection on Cdn Robustness , 2022 .

[70]  Ming Zhong,et al.  I/O system performance debugging using model-driven anomaly characterization , 2005, FAST'05.

[71]  Jayant R. Haritsa,et al.  MIRROR: a state-conscious concurrency control protocol for replicated real-time databases , 2002, Inf. Syst..

[72]  Vanish Talwar,et al.  Architecture for resource allocation services supporting interactive remote desktop sessions in utility grids , 2004, MGC '04.

[73]  Praveen Yalagandula,et al.  A scalable distributed information management system , 2004, SIGCOMM 2004.

[74]  Mark Handley,et al.  A scalable content-addressable network , 2001, SIGCOMM 2001.

[75]  Sanjoy K. Baruah,et al.  The partitioned multiprocessor scheduling of sporadic task systems , 2005, 26th IEEE International Real-Time Systems Symposium (RTSS'05).