CHOMP: A Framework and Instruction Set for Latency Tolerant, Massively Multithreaded Processors

Given the recent advent of the multicore era [1], we find that parallel application performance is no longer solely gated by an architecture's core arithmetic unit performance. Memory bandwidth has failed to grow at the same rate as effective core density. This paper presents a framework for constructing tightly coupled, chip-multithreading [CMT] processors that contain specific features well-suited to hiding latency to main memory and executing highly concurrent applications. This framework, deemed the “Convey Hybrid OpenMP” or CHOMP architecture, is built around a RISC instruction set that permits the hardware and software runtime mechanisms to participate in efficient scheduling of concurrent application workloads regardless of the distribution and type of instructions utilized. In this manner, all instructions in CHOMP have the ability to participate in the concurrency algorithms present in the hardware scheduler that drive context switch events. This, coupled with a set of hardware supported extended memory semantic instructions, means that the CHOMP architecture is well suited to executing applications that access memory using non-unit stride or irregular access patterns. Furthermore, the CHOMP architecture and framework contains specific logic and instruction set support that allows application-level, dynamic power gating of individual register files and function pipes.