Improving scalability and availability of cluster-based internet services

Many highly scalable and available online services are backed by large scale computer clusters in which service components are usually partitioned, replicated, and aggregated. This dissertation investigates techniques in improving scalability and availability of cluster-based online services. In particular, it studies: (1) Dependency Isolation, a new mechanism to support automatic recognition of dependency states and per-dependency management for thread-based multi-tier services. With the help of Dependency Isolation, partial failure or overload at one component will not cause cascading performance degradation in the entire system. (2) Data Aggregation Call, a novel programming primitive to exploit partition parallelism for large scale cluster-based Internet services. At the cluster level, our load-adaptive reduction tree construction algorithm balances processing and aggregation load across servers while exploiting partition parallelism. Inside each node, we employ an event-driven thread pool design that prevents slow nodes from adversely affecting system through-put under highly concurrent workload. We further devise a staged timeout scheme that eagerly prunes slow or unresponsive servers from the reduction tree to meet soft deadlines. (3) a scalable membership service that dynamically divides the entire cluster into membership groups based on the topology among nodes so that the liveness of a node within each group is published to others in a highly efficient manner. The above schemes have been implemented and evaluated with different service applications to improve their scalability and availability.