Limitations of Load Balancing Mechanisms for N-Tier Systems in the Presence of Millibottlenecks

The scalability of n-tier systems relies on effective load balancing to distribute load among the servers of the same tier. We found that load balancing mechanisms (and some policies) in servers used in typical n-tier systems (e.g., Apache and Tomcat) have issues of instability when very long response time (VLRT) requests appear due to millibottlenecks, very short bottlenecks that last only tens to hundreds of milliseconds. Experiments with standard n-tier benchmarks show that during millibottlenecks, some load balancing policy/mechanism combinations make the mistake of sending new requests to the node(s) suffering from millibottlenecks, instead of the idle nodes as load balancers are supposed to do. Several of these mistakes are due to the implicit assumptions made by load balancing policies and mechanisms on the stability of system state. Our study shows that appropriate remedies at policy and mechanism levels can avoid these mistakes during millibottlenecks and remove the VLRT requests, thus improving the average response time by a factor of 12.

[1]  Edward D. Lazowska,et al.  Adaptive load sharing in homogeneous distributed systems , 1986, IEEE Transactions on Software Engineering.

[2]  Calton Pu,et al.  Detecting Transient Bottlenecks in n-Tier Applications through Fine-Grained Analysis , 2013, 2013 IEEE 33rd International Conference on Distributed Computing Systems.

[3]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.

[4]  Abhishek Kumar,et al.  Lightweight, High-Resolution Monitoring for Troubleshooting Production Systems , 2008, OSDI.

[5]  Jialin Li,et al.  Tales of the Tail: Hardware, OS, and Application-level Sources of Tail Latency , 2014, SoCC.

[6]  Amin Vahdat,et al.  Less Is More: Trading a Little Bandwidth for Ultra-Low Latency in the Data Center , 2012, NSDI.

[7]  Anja Feldmann,et al.  C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection , 2015, NSDI.

[8]  Eric Koskinen,et al.  BorderPatrol: isolating events for black-box tracing , 2008, Eurosys '08.

[9]  Jeffrey S. Chase,et al.  Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control , 2004, OSDI.

[10]  Marcos K. Aguilera,et al.  Detecting failures in distributed systems with the Falcon spy network , 2011, SOSP.

[11]  Calton Pu,et al.  IO Performance Interference among Consolidated n-Tier Applications: Sharing Is Better Than Isolation for Disks , 2014, 2014 IEEE 7th International Conference on Cloud Computing.

[12]  Patrick Wendell,et al.  Sparrow: distributed, low latency scheduling , 2013, SOSP.

[13]  Calton Pu,et al.  Lightning in the cloud: a study of very short bottlenecks on n-tierweb application performance , 2014 .

[14]  Marcos K. Aguilera,et al.  Performance debugging for distributed systems of black boxes , 2003, SOSP '03.

[15]  Ron Kohavi,et al.  Practical guide to controlled experiments on the web: listen to your customers not to the hippo , 2007, KDD '07.

[16]  Brian D. Noble,et al.  Bobtail: Avoiding Long Tails in the Cloud , 2013, NSDI.

[17]  Albert G. Greenberg,et al.  Data center TCP (DCTCP) , 2010, SIGCOMM '10.

[18]  Philip S. Yu,et al.  Dynamic Load Balancing on Web-Server Systems , 1999, IEEE Internet Comput..

[19]  Christoforos E. Kozyrakis,et al.  Reconciling high server utilization and sub-millisecond quality-of-service , 2014, EuroSys '14.

[20]  Thomas F. Wenisch,et al.  The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services , 2014, OSDI.

[21]  Mor Harchol-Balter,et al.  PriorityMeister: Tail Latency QoS for Shared Networked Storage , 2014, SoCC.

[22]  Yong Meng Teo,et al.  Comparison of Load Balancing Strategies on Cluster-based Web Servers , 2001, Simul..

[23]  Calton Pu,et al.  Impact of DVFS on n-tier application performance , 2013, TRIOS@SOSP.

[24]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.