Exploiting new design tradeoffs in chip multiprocessor caches
暂无分享,去创建一个
Microprocessor industry has converged on chip multiprocessor (CMP) as the architecture of choice to utilize the numerous on-chip transistors. Multiple CMP cores substantially increase the capacity pressure on the limited on-chip cache capacity while requiring fast data access. The lowest level on-chip CMP cache not only needs to utilize its capacity effectively but also has to mitigate the increased latencies due to slow wire delay scaling. Conventional shared and private caches can provide either capacity or fast access but not both.
To mitigate wire delays in large lower-level caches, this thesis proposes a novel technique called Distance-Associativity, which employs non-uniform-access latency for widely-spaced cache subarrays. Distance associativity enables flexible placement of a core’s frequently-accessed data in the closest subarrays for fast access.
To provide both capacity and fast access in CMP caches, this thesis makes the key observation that CMPs fundamentally reverse the latency-capacity tradeoff that exists in conventional symmetric multiprocessors (SMPs) and distributed shared memory multiprocessors (DSMs). While CMPs rely on limited on-chip cache capacity but fast on-chip communication, SMPs and DSMs have virtually unlimited cache capacity but slow offchip communication. To exploit this tradeoff reversal, this thesis proposes three novel mechanisms: (i) controlled replication, (ii) in-situ communication, and (iii) capacity stealing.
This work also observes that commercial multithreaded programs exhibit substantial variations in capacity demands and communication behaviors. Optimizations using static replication thresholds such as controlled replication and in-situ communication cannot adapt to workload variations. To this end, this thesis proposes the use of dynamic replication thresholds in controlled replication and in-situ communication.
Experimental results show that for a 4-core CMP with 8 MB cache, the proposed CMP-NuRAPID cache outperforms conventional shared caches by 20% and 33% in multithreaded and multiprogrammed workloads respectively.
[1] Zeshan Chishti,et al. Optimizing replication, communication, and capacity allocation in CMPs , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).
[2] Zeshan Chishti,et al. Wire delay is not a problem for SMT (in the near future) , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..