HIGH PERFORMANCE AND SCALABLE SOFT SHARED STATE FOR NEXT-GENERATION DATACENTERS

In the past decade, with the increasing adoption of Internet as the primary means of electronic interaction and communication, web-based datacenters have become a central requirement for providing online services. Today, several applications and services have been deployed in such datacenters in a variety of environments including e-commerce, medical informatics, genomics, etc. Most of these applications and services share significant state information that are critical for the efficient functioning of the datacenter. However, existing mechanisms for sharing the state information are both inefficient in terms of performance and scalability, and non-resilient to loaded conditions in the datacenter. In addition, existing mechanisms do not take complete advantage of the features of emerging technologies which are gaining momentum in current datacenters. This dissertation presents an efficient soft state sharing substrate that leverages the features of emerging technologies such as high-speed networks, Intel’s I/OAT and multicore architectures to address the limitations mentioned above. Specifically, the dissertation targets three important aspects: (i) designing efficient state sharing components using the features of emerging technologies, (ii) understanding the interactions between the proposed components and (iii) analyzing the impact of the proposed components and their interactions with datacenter applications and services in terms of performance, scalability and resiliency. ii Our evaluations with the soft state sharing substrate not only show an order of magnitude performance improvement over traditional implementations but also demonstrate the resiliency to loaded conditions in the datacenter. Evaluations with several datacenter applications also suggest that the substrate is scalable and has a low-overhead. The proposed substrate is portable across multiple modern interconnects such as InfiniBand, iWARP-capable networks like 10-Gigabit Ethernet both in LAN and WAN environments. In addition, our designs provide advanced capabilities such as one-sided communication, asynchronous memory copy operations, etc., even on systems without high-speed networks and I/OAT. Thus, our proposed designs, optimizations and evaluations demonstrate that the substrate is quite promising in tackling the state sharing issues with current and next-generation datacenters.

[1]  Michael Calhoun,et al.  Characterization of Block Memory Operations , 2006 .

[2]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[3]  GhemawatSanjay,et al.  The Google file system , 2003 .

[4]  Alan L. Cox,et al.  Software DSM protocols that adapt between single writer and multiple writer , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[5]  Dhabaleswar K. Panda,et al.  Benefits of I/O Acceleration Technology (I/OAT) in Clusters , 2007, 2007 IEEE International Symposium on Performance Analysis of Systems & Software.

[6]  Roy Friedman,et al.  Implementing hybrid consistency with high-level synchronization operations , 1993, PODC '93.

[7]  Srihari Makineni,et al.  Architectural characterization of TCP/IP packet processing on the Pentium/spl reg/ M microprocessor , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[8]  Weimin Zheng,et al.  User-level communication based cooperative caching , 2003, OPSR.

[9]  Hyun-Wook Jin,et al.  Designing Efficient Cooperative Caching Schemes for Multi-Tier Data-Centers over RDMA-enabled Networks , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).

[10]  Srinivasan Parthasarathy,et al.  InterWeave: A Middleware System for Distributed Shared State , 2000, LCR.

[11]  Wu-chun Feng,et al.  End-to-end performance of 10-gigabit Ethernet on commodity systems , 2004, IEEE Micro.

[12]  Norman P. Jouppi,et al.  High-performance ethernet-based communications for future multi-core processors , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[13]  Trent Jaeger,et al.  How to schedule unlimited memory pinning of untrusted processes or provisional ideas about service-neutrality , 1999, Proceedings of the Seventh Workshop on Hot Topics in Operating Systems.

[14]  Greg J. Regnier,et al.  TCP performance re-visited , 2003, 2003 IEEE International Symposium on Performance Analysis of Systems and Software. ISPASS 2003..

[15]  Eric A. Brewer,et al.  Cluster-based scalable network services , 1997, SOSP.

[16]  Raphael Rom,et al.  Application-aware admission control and scheduling in Web servers , 2002, Proceedings.Twenty-First Annual Joint Conference of the IEEE Computer and Communications Societies.

[17]  John B. Carter,et al.  Khazana: an infrastructure for building distributed services , 1998, Proceedings. 18th International Conference on Distributed Computing Systems (Cat. No.98CB36183).

[18]  Charles L. Seitz,et al.  Myrinet: A Gigabit-per-Second Local Area Network , 1995, IEEE Micro.

[19]  Dhabaleswar K. Panda,et al.  High Performance VMM-Bypass I/O in Virtual Machines , 2006, USENIX Annual Technical Conference, General Track.

[20]  Laxmi N. Bhuyan,et al.  Hardware support for bulk data movement in server platforms , 2005, 2005 International Conference on Computer Design.

[21]  David Clark,et al.  An analysis of TCP processing overhead , 1989 .

[22]  Li Fan,et al.  Web caching and Zipf-like distributions: evidence and implications , 1999, IEEE INFOCOM '99. Conference on Computer Communications. Proceedings. Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies. The Future is Now (Cat. No.99CH36320).

[23]  Dhabaleswar K. Panda,et al.  Designing Efficient Asynchronous Memory Operations Using Hardware Copy Engine: A Case Study with I/OAT , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[24]  Jeffrey S. Chase,et al.  End system optimizations for high-speed TCP , 2001, IEEE Commun. Mag..

[25]  Ricardo Bianchini,et al.  Efficiency vs. portability in cluster-based network servers , 2001, PPoPP '01.

[26]  Greg J. Regnier,et al.  CSP: A Novel System Architecture for Scalable Internet and Communication Services , 2001, USITS.

[27]  Dhabaleswar K. Panda,et al.  Efficient asynchronous memory copy operations on multi-core systems and I/OAT , 2007, 2007 IEEE International Conference on Cluster Computing.

[28]  Dhabaleswar K. Panda,et al.  Supporting Strong Cache Coherency for Active Caches in Multi-Tier Data-Centers over , 2004 .

[29]  Rudolf Bayer,et al.  Organization and maintenance of large ordered indexes , 1972, Acta Informatica.

[30]  Michael L. Scott,et al.  Efficient distributed shared state for heterogeneous machine architectures , 2003, 23rd International Conference on Distributed Computing Systems, 2003. Proceedings..

[31]  Dhabaleswar K. Panda,et al.  Optimized Distributed Data Sharing Substrate in Multi-core Commodity Clusters: A Comprehensive Study with Applications , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[32]  Jiesheng Wu,et al.  Memory registration caching correctness , 2005, CCGrid 2005. IEEE International Symposium on Cluster Computing and the Grid, 2005..

[33]  Dhabaleswar K. Panda,et al.  DDSS: A Low-Overhead Distributed Data Sharing Substrate for Cluster-Based Data-Centers over Modern Interconnects , 2006, HiPC.

[34]  Randy H. Katz,et al.  Effective web service load balancing through statistical monitoring , 2006, Commun. ACM.

[35]  Hyun-Wook Jin,et al.  NemC: A Network Emulator for Cluster-of-Clusters , 2006, Proceedings of 15th International Conference on Computer Communications and Networks.

[36]  Michael L. Scott,et al.  Exploiting high-level coherence information to optimize distributed shared state , 2003, PPoPP '03.

[37]  Robert W. Numrich,et al.  Co-array Fortran for parallel programming , 1998, FORF.

[38]  Dhabaleswar K. Panda,et al.  Designing High Performance and Scalable MPI Intra-node Communication Support for Clusters , 2006, 2006 IEEE International Conference on Cluster Computing.

[39]  Sayantan Sur,et al.  LiMIC: support for high-performance MPI intra-node communication on Linux cluster , 2005, 2005 International Conference on Parallel Processing (ICPP'05).

[40]  Amin Vahdat,et al.  Managing energy and server resources in hosting centers , 2001, SOSP.

[41]  Greg J. Regnier,et al.  TCP onloading for data center servers , 2004, Computer.

[42]  Michael J. Feeley,et al.  Network virtual memory , 2003 .

[43]  Michael L. Scott,et al.  Integrating remote invocation and distributed shared state , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[44]  Douglas C. Schmidt,et al.  The Design of an Adaptive Middleware Load Balancing and Monitoring Service , 2003 .

[45]  Guillaume Mercier,et al.  Design and evaluation of Nemesis, a scalable, low-latency, message-passing communication subsystem , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).

[46]  Dhabaleswar K. Panda,et al.  PVFS over InfiniBand: design and performance evaluation , 2003, 2003 International Conference on Parallel Processing, 2003. Proceedings..

[47]  Dan Bonachea,et al.  A new DMA registration strategy for pinning-based high performance networks , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[48]  Dhabaleswar K. Panda,et al.  Advanced RDMA-Based Admission Control for Modern Data-Centers , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[49]  Srinivasan Parthasarathy,et al.  InterAct: Virtual Sharing for Interactive Client-Server Applications , 1998, LCR.

[50]  Marios Hadjieleftheriou,et al.  R-Trees - A Dynamic Index Structure for Spatial Searching , 2008, ACM SIGSPATIAL International Workshop on Advances in Geographic Information Systems.

[51]  Evangelos P. Markatos,et al.  User-level DMA without operating system kernel modification , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[52]  Joel H. Saltz,et al.  Database Support for Data-Driven Scientific Applications in the Grid , 2003, Parallel Process. Lett..

[53]  David K. Y. Yau,et al.  Admission control and dynamic adaptation for a proportional-delay diffserv-enabled web server , 2002, SIGMETRICS '02.

[54]  Giuseppe Ciaccio Using a Self-connected Gigabit Ethernet Adapter as a memcpy() Low-Overhead Engine for MPI , 2003, PVM/MPI.

[55]  Tao Yang,et al.  Cluster load balancing for fine-grain network services , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[56]  Kai Li,et al.  Protected, user-level DMA for the SHRIMP network interface , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[57]  Hemal Shah,et al.  Direct Data Placement over Reliable Transports , 2007, RFC.

[58]  Jialin Ju,et al.  ARMCI: A Portable Aggregate Remote Memory Copy Interface , 2000 .

[59]  Dhabaleswar K. Panda,et al.  High Performance Distributed Lock Management Services using Network-based Remote Atomic Operations , 2007, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07).

[60]  Hyun-Wook Jin,et al.  Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand , 2004 .

[61]  Alan L. Cox,et al.  Bottleneck Characterization of Dynamic Web Site Benchmarks , 2002 .

[62]  Hyun-Wook Jin,et al.  Exploiting RDMA operations for Providing Efficient Fine-Grained Resource Monitoring in Cluster-based Servers , 2006, 2006 IEEE International Conference on Cluster Computing.

[63]  Dhabaleswar K. Panda,et al.  Designing Efficient Systems Services and Primitives for Next-Generation Data-Centers , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[65]  Hyun-Wook Jin,et al.  Supporting iWARP Compatibility and Features for Regular Network Adapters , 2005, 2005 IEEE International Conference on Cluster Computing.