A first glance at Kilo-instruction based multiprocessors

The ever increasing gap between processor and memory speed, sometimes referred to as the Memory Wall problem [42], has a very negative impact on performance. This mismatch will be more severe in future processor's generation. Modern cache organizations and prefetching techniques will not be able to solve this problem. A very novel and promising technique to deal with the Memory Wall consists on designing processors able to maintain thousands of in-flight instructions. An example of this kind of processors has been denoted as Kilo-instruction processors [8]. When running numerical applications, Kilo-instruction processors have demonstrated its ability to effectively maintain high values of IPC while increasing memory latencies.In this paper, we will study for the first time, the influence of Kilo-instruction processors on the performance of small-scale CC-NUMA multiprocessors. Our first results, using an ideal network, show the enormous potential of the Kilo-instruction processors, when using them as computing nodes, not only for hiding local DRAM latencies but also for the remote ones. A deeper analysis, using realistic networks, reveals the existence of heavy demands on packet throughput required by each node, since larger re-order buffers translate on higher density of remote accesses. Next, we show that current interconnection networks cannot cope with this high traffic levels, so newer and faster networks have to be designed. In short, our results show dramatic performance gains over multiprocessors based on current microprocessors and dictate a possible way to build future shared-memory multiprocessor systems.

[1]  Cruz Izu,et al.  On the Design of a High-Performance Adaptive Router for CC-NUMA Multiprocessors , 2003, IEEE Trans. Parallel Distributed Syst..

[2]  Jean-Loup Baer,et al.  Cost-effective compiler directed memory prefetching and bypassing , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[3]  Anoop Gupta,et al.  SPLASH: Stanford parallel applications for shared-memory , 1992, CARN.

[4]  T. N. Vijaykumar,et al.  Is SC + ILP = RC? , 1999, ISCA.

[5]  R. M. Tomasulo,et al.  An efficient algorithm for exploiting multiple arithmetic units , 1995 .

[6]  Carmen Carrión,et al.  A flow control mechanism to avoid message deadlock in k-ary n-cube networks , 1997, Proceedings Fourth International Conference on High-Performance Computing.

[7]  M. Dubois,et al.  Assisted Execution , 1998 .

[8]  Josep Torrellas,et al.  Using a user-level memory thread for correlation prefetching , 2002, ISCA.

[9]  Mateo Valero,et al.  Delaying physical register allocation through virtual-physical registers , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[10]  John Paul Shen,et al.  Dynamic speculative precomputation , 2001, MICRO.

[11]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[12]  Cruz Izu,et al.  The Adaptive Bubble Router , 2001, J. Parallel Distributed Comput..

[13]  C.B. Stunkel,et al.  A New Switch Chip for IBM RS/6000 SP Systems , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[14]  Josep Llosa,et al.  Out-of-order commit processors , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[15]  Craig Zilles,et al.  Execution-based prediction using speculative slices , 2001, ISCA 2001.

[16]  T. N. Vijaykumar,et al.  Reducing Design Complexity of the Load/Store Queue , 2003, MICRO.

[17]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[18]  Yale N. Patt,et al.  Checkpoint repair for out-of-order execution machines , 1987, ISCA '87.

[19]  Per Stenström,et al.  A prefetching technique for irregular accesses to linked data structures , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[20]  Alan Jay Smith,et al.  Cache Memories , 1982, CSUR.

[21]  Haitham Akkary,et al.  Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors , 2003, MICRO.

[22]  Jean-Loup Baer,et al.  An effective on-chip preloading scheme to reduce data access penalty , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[23]  Sarita V. Adve,et al.  Using speculative retirement and larger instruction windows to narrow the performance gap between memory consistency models , 1997, SPAA '97.

[24]  Douglas J. Joseph,et al.  Prefetching Using Markov Predictors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[25]  Shubhendu S. Mukherjee,et al.  The Alpha 21364 network architecture , 2001, HOT 9 Interconnects. Symposium on High Performance Interconnects.

[26]  Sarita V. Adve,et al.  RSIM: An Execution-Driven Simulator for ILP-Based Shared-Memory Multiprocessors and Uniprocessors , 1997 .

[27]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[28]  Henry M. Levy,et al.  An Architecture for Software-Controlled Data Prefetching , 1991, ISCA.

[29]  David F. Heidel,et al.  An Overview of the BlueGene/L Supercomputer , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[30]  Josep Llosa,et al.  Kilo-instruction Processors , 2003, ISHPC.

[31]  Valentin Puente,et al.  SICOSYS: an integrated framework for studying interconnection network performance in multiprocessor systems , 2002, Proceedings 10th Euromicro Workshop on Parallel, Distributed and Network-based Processing.

[32]  Simha Sethumadhavan,et al.  Scalable hardware memory disambiguation for high-ILP processors , 2003, IEEE Micro.

[33]  Mike Galles Spider: a high-speed network interconnect , 1997, IEEE Micro.

[34]  Anoop Gupta,et al.  Performance evaluation of memory consistency models for shared-memory multiprocessors , 1991, ASPLOS IV.

[35]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[36]  Gurindar S. Sohi,et al.  Speculative data-driven multithreading , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[37]  Josep Llosa,et al.  Large virtual robs by processor checkpointing , 2002 .

[38]  Josep Llosa,et al.  A case for resource-conscious out-of-order processors , 2004, IEEE Computer Architecture Letters.

[39]  Yale N. Patt,et al.  Simultaneous subordinate microthreading (SSMT) , 1999, ISCA.

[40]  Anoop Gupta,et al.  Hiding memory latency using dynamic scheduling in shared-memory multiprocessors , 1992, ISCA '92.