论文信息 - CHOMP: A Framework and Instruction Set for Latency Tolerant, Massively Multithreaded Processors

CHOMP: A Framework and Instruction Set for Latency Tolerant, Massively Multithreaded Processors

Given the recent advent of the multicore era [1], we find that parallel application performance is no longer solely gated by an architecture's core arithmetic unit performance. Memory bandwidth has failed to grow at the same rate as effective core density. This paper presents a framework for constructing tightly coupled, chip-multithreading [CMT] processors that contain specific features well-suited to hiding latency to main memory and executing highly concurrent applications. This framework, deemed the “Convey Hybrid OpenMP” or CHOMP architecture, is built around a RISC instruction set that permits the hardware and software runtime mechanisms to participate in efficient scheduling of concurrent application workloads regardless of the distribution and type of instructions utilized. In this manner, all instructions in CHOMP have the ability to participate in the concurrency algorithms present in the hardware scheduler that drive context switch events. This, coupled with a set of hardware supported extended memory semantic instructions, means that the CHOMP architecture is well suited to executing applications that access memory using non-unit stride or irregular access patterns. Furthermore, the CHOMP architecture and framework contains specific logic and instruction set support that allows application-level, dynamic power gating of individual register files and function pipes.

John D. Leidel | Tony Brewer | Kevin Wadleigh | Joe Bolding | Dean Walker

[1] G. Gao,et al. FAST : A Functionally Accurate Simulation Toolset for the Cyclops 64 Cellular Architecture , 2005 .

[2] A. Kumar,et al. Implementation of an 8-Core, 64-Thread, Power-Efficient SPARC Server on a Chip , 2008, IEEE Journal of Solid-State Circuits.

[3] Petr Konecny. Introducing the Cray XMT , 2007 .

[4] Allan Porterfield,et al. The Tera computer system , 1990, ICS '90.

[5] Stephen L. Olivier,et al. Scheduling task parallelism on multi-socket multicore systems , 2011, ROSS '11.

[6] Ken Kennedy,et al. Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution , 1993, LCPC.

[7] Mark D. Hill,et al. Amdahl's Law in the Multicore Era , 2008, Computer.

[8] David A. Patterson,et al. Computer Architecture - A Quantitative Approach, 5th Edition , 1996 .

[9] John Hawkes,et al. Linux® Scalability for Large NUMA Systems , 2003 .

[10] Guang R. Gao,et al. Landing openMP on cyclops-64: an efficient mapping of openMP to a many-core system-on-a-chip , 2006, CF '06.