Performance scalability of a multi-core web server

Today's large multi-core Internet servers support thousands of concurrent connections or ows. The computation ability of future server platforms will depend on increasing numbers of cores. The key to ensure that performance scales with cores is to ensure that systems software and hardware are designed to fully exploit the parallelism that is inherent in independent network ows. This paper identifies the major bottlenecks to scalability for a reference server workload on a commercial server platform. However, performance scaling on commercial web servers has proven elusive. We determined that on web server running a modified SPEC-web2005 Support workload, throughput scales only 4.8 x on eight cores. Our results show that the operating system, TCP/IP stack, and application exploited ow-level parallelism well with few exceptions, and that load imbalance and shared cache affected performance little. Having eliminated these potential bottlenecks, we determined that performance scaling was limited by the capacity of the address bus, which became saturated on all eight cores. If this key obstacle is addressed, commercial web server and systems software are well-positioned to scale to a large number of cores.

[1]  Jean-Yves Le Boudec,et al.  Adaptive Load Sharing for Network Processors , 2002, IEEE/ACM Transactions on Networking.

[2]  David D. Clark,et al.  An analysis of TCP processing overhead , 1988, IEEE Communications Magazine.

[3]  Donald F. Towsley,et al.  The effectiveness of affinity-based scheduling in multiprocessor network protocol processing (extended version) , 1996, TNET.

[4]  Laurent Lefèvre,et al.  Packet classification in the NIC for improved SMP-based Internet servers , 2003 .

[5]  Pawel Gburzynski,et al.  Load balancing for parallel forwarding , 2005, IEEE/ACM Transactions on Networking.

[6]  Lukas Kencl,et al.  Sequence-preserving adaptive load balancers , 2006, 2006 Symposium on Architecture For Networking And Communications Systems.

[7]  Ravi R. Iyer Characterization and Evaluation of Cache Hierarchies for Web Servers , 2004, World Wide Web.

[8]  Milo M. K. Martin,et al.  Using destination-set prediction to improve the latency/bandwidth tradeoff in shared-memory multiprocessors , 2003, ISCA '03.

[9]  Alan L. Cox,et al.  An Evaluation of Network Stack Parallelization Strategies in Modern Operating Systems , 2006, USENIX Annual Technical Conference, General Track.

[10]  Donald Newell,et al.  Architectural Characterization of Processor Affinity in Network Processing , 2005, IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005..

[11]  Jeffrey S. Chase,et al.  End system optimizations for high-speed TCP , 2001, IEEE Commun. Mag..