Organizing the last line of defense before hitting the memory wall for CMPs

The last line of defense in the cache hierarchy before going to off-chip memory is very critical in chip multiprocessors (CMPs) from both the performance and power perspectives. We investigate different organizations for this last line of defense (assumed to be L2 in this article) towards reducing off-chip memory accesses. We evaluate the trade-offs between private L2 and address-interleaved shared L2 designs, noting their individual benefits and drawbacks. The possible imbalance between the L2 demands across the CPUs favors a shared L2 organization, while the interference between these demands can favor a private L2 organization. We propose a new architecture, called Shared Processor-Based Split L2, that captures the benefits of these two organizations, while avoiding many of their drawbacks. Using several applications from the SPEC OMP suite and a commercial benchmark, Specjbb, on a complete system simulator, we demonstrate the benefits of this shared processor-based L2 organization. Our results show as much as 42.50% improvement in IPC over the private organization (with 11.52% on the average), and as much as 42.22% improvement over the shared interleaved organization (with 9.76% on the average).

[1]  G. Edward Suh,et al.  Dynamic Partitioning of Shared Cache Memory , 2004, The Journal of Supercomputing.

[2]  R. Stets,et al.  A detailed comparison of two transaction processing workloads , 2002, 2002 IEEE International Workshop on Workload Characterization.

[3]  Anoop Gupta,et al.  The impact of architectural trends on operating system performance , 1995, SOSP.

[4]  Rudolf Eigenmann,et al.  SPEComp: A New Benchmark Suite for Measuring Parallel Computer Performance , 2001, WOMPAT.

[5]  Luiz André Barroso,et al.  Impact of chip-level integration on performance of OLTP workloads , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[6]  Mahmut T. Kandemir,et al.  An integer linear programming based approach for parallelizing applications in On-chip multiprocessors , 2002, DAC '02.

[7]  Kunle Olukotun,et al.  The case for a single-chip multiprocessor , 1996, ASPLOS VII.

[8]  W. Nebel System-level power optimization , 2004 .

[9]  Francky Catthoor,et al.  Custom Memory Management Methodology: Exploration of Memory Organisation for Embedded Multimedia System Design , 1998 .

[10]  Luiz André Barroso,et al.  Piranha: a scalable architecture based on single-chip multiprocessing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[11]  G. Edward Suh,et al.  Dynamic Cache Partitioning for Simultaneous Multithreading Systems , 2004 .

[12]  Margaret Martonosi,et al.  Wattch: a framework for architectural-level power analysis and optimizations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[13]  Norman P. Jouppi,et al.  Reconfigurable caches and their application to media processing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[14]  Luca Benini,et al.  System-level power optimization: techniques and tools , 1999, Proceedings. 1999 International Symposium on Low Power Electronics and Design (Cat. No.99TH8477).

[15]  Hugh Garraway Parallel Computer Architecture: A Hardware/Software Approach , 1999, IEEE Concurrency.

[16]  Kathryn M. Jackson,et al.  IBM S/390 storage hierarchy - G5 and G6 performance considerations , 1999, IBM J. Res. Dev..

[17]  Michel Dubois,et al.  Evaluation of Shared Cache Architectures for TPC-H , 2022 .

[18]  David H. Albonesi,et al.  Selective cache ways: on-demand cache resource allocation , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[19]  Anoop Gupta,et al.  Parallel computer architecture - a hardware / software approach , 1998 .

[20]  Kaushik Roy,et al.  An integrated circuit/architecture approach to reducing leakage in deep-submicron high-performance I-caches , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[21]  Norman P. Jouppi,et al.  CACTI 2.0: An Integrated Cache Timing and Power Model , 2002 .

[22]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[23]  Josep Torrellas,et al.  A Chip-Multiprocessor Architecture with Speculative Multithreading , 1999, IEEE Trans. Computers.