Scalable, accurate multicore simulation in the 1000-core era

We present HORNET, a parallel, highly configurable, cycle-level multicore simulator based on an ingress-queued worm-hole router NoC architecture. The parallel simulation engine offers cycle-accurate as well as periodic synchronization; while preserving functional accuracy, this permits tradeoffs between perfect timing accuracy and high speed with very good accuracy. When run on 6 separate physical cores on a single die, speedups can exceed a factor of over 5, and when run on a two-die 12-core system with 2-way hyperthreading, speedups exceed 11 ×. Most hardware parameters are configurable, including memory hierarchy, interconnect geometry, bandwidth, crossbar dimensions, and parameters driving power and thermal effects. A highly parametrized table-based NoC design allows a variety of routing and virtual channel allocation algorithms out of the box, ranging from simple DOR routing to complex Valiant, ROMM, or PROM schemes, BSOR, and adaptive routing. HORNET can run in network-only mode using synthetic traffic or traces, directly emulate a MIPS-based multicore, or function as the memory subsystem for native applications executed under the Pin instrumentation tool. HORNET is freely available under the open-source MIT license at http://csg.csail.mit.edu/hornet/.

[1]  Srinivas Devadas,et al.  Path-based, Randomized, Oblivious, Minimal routing , 2009, 2009 2nd International Workshop on Network on Chip Architectures.

[2]  Akif Ali,et al.  Near-optimal worst-case throughput routing for two-dimensional mesh networks , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[3]  Coniferous softwood GENERAL TERMS , 2003 .

[4]  Srinivas Devadas,et al.  Oblivious Routing in On-Chip Bandwidth-Adaptive Networks , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[5]  Kevin Skadron,et al.  Temperature-aware microarchitecture , 2003, ISCA '03.

[6]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[7]  David A. Patterson,et al.  RAMP: research accelerator for multiple processors - a community vision for a shared experimental parallel HW/SW platform , 2006, ISPASS.

[8]  Nick McKeown,et al.  The iSLIP scheduling algorithm for input-queued switches , 1999, TNET.

[9]  Stefan Rusu,et al.  A 45nm 8-core enterprise Xeon ® processor , 2009 .

[10]  G. Edward Suh,et al.  Application-aware deadlock-free oblivious routing , 2009, ISCA '09.

[11]  George Kurian,et al.  Graphite: A distributed parallel simulator for multicores , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[12]  Sharad Malik,et al.  Orion: a power-performance simulator for interconnection networks , 2002, MICRO.

[13]  Sarita V. Adve,et al.  RSIM: Rice simulator for ILP multiprocessors , 1997, CARN.

[14]  Leslie G. Valiant,et al.  Universal schemes for parallel communication , 1981, STOC '81.

[15]  S. Lennart Johnsson,et al.  ROMM Routing: A Class of Efficient Minimal Routing Algorithms , 1994, PCRCW.

[16]  D. Banks,et al.  Assembly and Packaging , 2006 .

[17]  G. Edward Suh,et al.  Static virtual channel allocation in oblivious routing , 2009, 2009 3rd ACM/IEEE International Symposium on Networks-on-Chip.

[18]  David Wentzlaff,et al.  Processor: A 64-Core SoC with Mesh Interconnect , 2010 .

[19]  Arnab Banerjee,et al.  Flow-aware allocation for on-chip networks , 2009, 2009 3rd ACM/IEEE International Symposium on Networks-on-Chip.

[20]  Arvind,et al.  A-Port Networks: Preserving the Timed Behavior of Synchronous Systems for Modeling on FPGAs , 2009, TRETS.

[21]  Lynn Elliot Cannon,et al.  A cellular computer to implement the kalman filter algorithm , 1969 .

[22]  Todd M. Austin,et al.  SimpleScalar: An Infrastructure for Computer System Modeling , 2002, Computer.

[23]  Srinivas Devadas,et al.  Guaranteed in-order packet delivery using Exclusive Dynamic Virtual Channel Allocation , 2009 .

[24]  Jonathan Chang,et al.  A 45 nm 8-Core Enterprise Xeon¯ Processor , 2010, IEEE J. Solid State Circuits.

[25]  Henry Hoffmann,et al.  On-Chip Interconnection Architecture of the Tile Processor , 2007, IEEE Micro.

[26]  Andrew B. Kahng,et al.  ORION 2.0: A fast and accurate NoC power and area model for early-stage design space exploration , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[27]  Aamer Jaleel,et al.  Analyzing Parallel Programs with PIN , 2010, Computer.

[28]  Valentin Puente,et al.  SICOSYS: an integrated framework for studying interconnection network performance in multiprocessor systems , 2002, Proceedings 10th Euromicro Workshop on Parallel, Distributed and Network-based Processing.

[29]  Niraj K. Jha,et al.  GARNET: A detailed on-chip network model inside a full-system simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.