The thread-based protocol engines for CC-NUMA multiprocessors

With the vast advances of Internet services, large-scale and high-performance servers, such as CC-NUMA multiprocessors, are gaining importance in network computing. In a CC-NUMA multiprocessor, the key component to connect a computing node to the interconnection network is the node controller. Node controllers perform protocol processing to transmit messages with other nodes in the system. As the new generation CC-NUMA multiprocessors are moving towards application-specific protocol processing, a node controller will require very powerful protocol processors or engines to provide the flexibility of processing different kinds of protocols. In this paper, we study the design of a thread-based node controller, in which protocol engines have a multithreaded architecture. Multithreading allows protocol processing of different requests to proceed in parallel, whereby reducing blocking and improving response time. Four important design parameters for a multithreaded protocol engine are examined: (1) the number of thread context storages, (2) the number of protocol operation units, (3) the scheduling policy and (4) the thread allocation scheme. From the application-driven simulation on six representative applications, we conclude that the number of thread contexts and protocol operation units have a great impact on the overall system performance. An appropriate thread allocation scheme for invalidation traffic is needed, and prioritizing a thread and scheduling it accordingly are also important for the system performance.

[1]  Anoop Gupta,et al.  Parallel computer architecture - a hardware / software approach , 1998 .

[2]  D. Lenoski,et al.  The SGI Origin: A ccnuma Highly Scalable Server , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[3]  Chung-Ta King,et al.  MICA: a memory and interconnect simulation environment for cache-based architectures , 2000, Proceedings 33rd Annual Simulation Symposium (SS 2000).

[4]  B J Smith,et al.  A pipelined, shared resource MIMD computer , 1986 .

[5]  James R. Goodman,et al.  Efficient Synchronization: Let Them Eat QOLB , 1997, International Symposium on Computer Architecture.

[6]  Michael C. Browne,et al.  S-Connect: from networks of workstations to supercomputer performance , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[7]  Michael C. Browne,et al.  The S3.mp scalable shared memory multiprocessor , 1994, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[8]  Anoop Gupta,et al.  The Stanford Dash multiprocessor , 1992, Computer.

[9]  Anoop Gupta,et al.  Performance evaluation of memory consistency models for shared-memory multiprocessors , 1991, ASPLOS IV.

[10]  Anoop Gupta,et al.  Comparative performance evaluation of cache-coherent NUMA and COMA architectures , 1992, ISCA '92.

[11]  Chung-Ta King,et al.  Does multicast communication make sense in write invalidation traffic? , 2000, Proceedings Seventh International Conference on Parallel and Distributed Systems (Cat. No.PR00568).

[12]  Tom Lovett,et al.  STiNG: A CC-NUMA Computer System for the Commercial Marketplace , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[13]  Anoop Gupta,et al.  Flexible use of memory for replication/migration in cache-coherent DSM multiprocessors , 1998, ISCA.

[14]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[15]  D.A. Wood,et al.  Reactive NUMA: A Design For Unifying S-COMA And CC-NUMA , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[16]  T. Lovett,et al.  STiNG: A CC-NUMA Computer System for the Commercial Marketplace , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[17]  Herb Schwetman,et al.  Using CSIM to model complex systems , 1988, 1988 Winter Simulation Conference Proceedings.

[18]  Maged M. Michael,et al.  Coherence controller architectures for SMP-based CC-NUMA multiprocessors , 1997, ISCA '97.

[19]  Anant Agarwal,et al.  APRIL: a processor architecture for multiprocessing , 1990, ISCA '90.

[20]  Anoop Gupta,et al.  Memory consistency and event ordering in scalable shared-memory multiprocessors , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[21]  Michael L. Scott,et al.  Contention-free combining tree barriers , 1994 .

[22]  Todd M. Austin,et al.  Zero-cycle loads: microarchitecture support for reducing load latency , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[23]  Vicki H. Allan,et al.  Petri net versus module scheduling for software pipelining , 1995, MICRO 1995.

[24]  Chung-Ta King,et al.  A Simulation Toolkit for x86-Compatible Processors - XSim , 1999, Int. J. High Speed Comput..

[25]  Mark D. Hill,et al.  Multiprocessors Should Support Simple Memory-Consistency Models , 1998, Computer.

[26]  Donald Yeung,et al.  The MIT Alewife machine: architecture and performance , 1995, ISCA '98.

[27]  Carla Schlatter Ellis,et al.  Experimental comparison of memory management policies for NUMA multiprocessors , 1991, TOCS.

[28]  Chung-Ta King,et al.  Boosting the performance of NOW-based shared memory multiprocessors through directory hints , 2000, Proceedings 20th IEEE International Conference on Distributed Computing Systems.

[29]  David A. Wood,et al.  Decoupled Hardware Support for Distributed Shared Memory , 1996, ISCA.

[30]  Anoop Gupta,et al.  The performance impact of flexibility in the Stanford FLASH multiprocessor , 1994, ASPLOS VI.

[31]  David R. O'Hallaron,et al.  Earthquake ground motion modeling on parallel computers , 1996, Supercomputing '96.

[32]  Anoop Gupta,et al.  Memory consistency and event ordering in scalable shared-memory multiprocessors , 1990, ISCA '90.

[33]  Sudhakar Yalamanchili,et al.  Interconnection Networks: An Engineering Approach , 2002 .