论文信息 - Efficient throughput cores for asymmetric manycore processors

Efficient throughput cores for asymmetric manycore processors

The microprocessor industry has had to switch from developing ever more complex and more deeply pipelined single-core processors to multicore processors due to running into power, thermal and complexity limits. Future microprocessors will be asymmetric manycore chip multiprocessors, with a small number of complex cores for serial programs and serial sections of parallel programs. The majority of the cores will be small, power- and area-efficient cores to maximize overall throughput in a limited power budget. The main contributions of this dissertation are techniques for improving the performance and area-efficiency of these throughput-oriented cores. This work shows how the single-thread performance of small, scalar cores can be increased or dynamically combined to speed up programs with only a limited number of parallel threads. It also shows how to improve both the cores and the cache subsystem of multicore processor using SIMD cores.

Kevin Skadron | David Tarjan | K. Skadron | D. Tarjan

[1] Kunle Olukotun,et al. Maximizing CMP throughput with mediocre cores , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[2] Allan Porterfield,et al. The Tera computer system , 1990 .

[3] Balaram Sinharoy,et al. IBM Power5 chip: a dual-core multithreaded processor , 2004, IEEE Micro.

[4] Per Stenström,et al. A Survey of Cache Coherence Schemes for Multiprocessors , 1990, Computer.

[5] Kevin Skadron,et al. Leveraging Memory Level Parallelism Using Dynamic Warp Subdivision , 2009 .

[6] Engin Ipek,et al. Core fusion: accommodating software diversity in chip multiprocessors , 2007, ISCA '07.

[7] Kevin Skadron,et al. Federation: Repurposing scalar cores for out-of-order instruction issue , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[8] Pat Hanrahan,et al. Brook for GPUs: stream computing on graphics hardware , 2004, SIGGRAPH 2004.

[9] Marcelo Yuffe,et al. The Implementation of the 65nm Dual-Core 64b Merom Processor , 2007, 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.

[10] Berkin Özisikyilmaz,et al. MineBench: A Benchmark Suite for Data Mining Workloads , 2006, 2006 IEEE International Symposium on Workload Characterization.

[11] Takayasu Sakurai,et al. A simple MOSFET model for circuit analysis , 1991 .

[12] Aaftab Munshi,et al. The OpenCL specification , 2009, 2009 IEEE Hot Chips 21 Symposium (HCS).

[13] Amir Roth,et al. Store vulnerability window (SVW): re-execution filtering for enhanced load optimization , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[14] Stéphan Jourdan,et al. An Exploration of Instruction Fetch Requirement in Out-of-Order Superscalar Processors , 2004, International Journal of Parallel Programming.

[15] Lizy Kurian John,et al. Scaling to the end of silicon with EDGE architectures , 2004, Computer.

[16] Philippe Roussel,et al. The microarchitecture of the intel pentium 4 processor on 90nm technology , 2004 .

[17] Yao Zhang,et al. Parallel Computing Experiences with CUDA , 2008, IEEE Micro.

[18] R.H. Dennard,et al. Design Of Ion-implanted MOSFET's with Very Small Physical Dimensions , 1974, Proceedings of the IEEE.

[19] Amitabh Varshney,et al. High-throughput sequence alignment using Graphics Processing Units , 2007, BMC Bioinformatics.

[20] Willy Zwaenepoel,et al. Implementation and performance of Munin , 1991, SOSP '91.

[21] Samuel Williams,et al. The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[22] Onur Mutlu,et al. Accelerating critical section execution with asymmetric multi-core architectures , 2009, ASPLOS.

[23] D.H. Albonesi,et al. An oldest-first selection logic implementation for non-compacting issue queues [microprocessor power reduction] , 2002, 15th Annual IEEE International ASIC/SOC Conference.

[24] Doug Burger,et al. Evaluating Future Microprocessors: the SimpleScalar Tool Set , 1996 .

[25] Shreekant S. Thakkar,et al. Internet Streaming SIMD Extensions , 1999, Computer.

[26] H. Peter Hofstee,et al. Power efficient processor architecture and the cell processor , 2005, 11th International Symposium on High-Performance Computer Architecture.

[27] Richard M. Russell,et al. The CRAY-1 computer system , 1978, CACM.

[28] M. K. Gowan,et al. A 65 nm 2-Billion Transistor Quad-Core Itanium Processor , 2009, IEEE Journal of Solid-State Circuits.

[29] Kevin Skadron,et al. Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[30] Amir Roth. Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load Optimization , 2005, ISCA 2005.

[31] E. Fluhr,et al. Design and Implementation of the POWER6 Microprocessor , 2008, IEEE Journal of Solid-State Circuits.

[32] Vikas Agarwal,et al. Clock rate versus IPC: the end of the road for conventional microarchitectures , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[33] John Paul Shen,et al. Best of both latency and throughput , 2004, IEEE International Conference on Computer Design: VLSI in Computers and Processors, 2004. ICCD 2004. Proceedings..

[34] Kai Li,et al. Virtual-Memory-Mapped Network Interfaces , 1995, IEEE Micro.

[35] Gurindar S. Sohi,et al. Characterizing and predicting value degree of use , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[36] Norman P. Jouppi,et al. Single-ISA heterogeneous multi-core architectures for multithreaded workload performance , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[37] Fred Weber,et al. AMD 3DNow! technology: architecture and implementations , 1999, IEEE Micro.

[38] William J. Dally,et al. Programmable Stream Processors , 2003, Computer.

[39] J. Torrellas,et al. Energy-efficient hybrid wakeup logic , 2002, Proceedings of the International Symposium on Low Power Electronics and Design.

[40] Gabriel H. Loh,et al. Matrix scheduler reloaded , 2007, ISCA '07.

[41] Doug Burger,et al. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[42] Aaron Smith,et al. Compiling for EDGE architectures , 2006, International Symposium on Code Generation and Optimization (CGO'06).

[43] Pradeep Dubey,et al. Larrabee: A Many-Core x86 Architecture for Visual Computing , 2009, IEEE Micro.

[44] Jichuan Chang,et al. Cooperative Caching for Chip Multiprocessors , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[45] S. Tomita,et al. A high-speed dynamic instruction scheduling scheme for supersealar processors , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[46] Gabriel H. Loh,et al. Fire-and-Forget: Load/Store Scheduling with No Store Queue at All , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[47] M. Tremblay,et al. UltraSparc I: a four-issue processor supporting multimedia , 1996, IEEE Micro.

[48] H. Cheong,et al. A cache coherence scheme with fast selective invalidation , 1988, [1988] The 15th Annual International Symposium on Computer Architecture. Conference Proceedings.

[49] Brian Fahs,et al. Microarchitecture optimizations for exploiting memory-level parallelism , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[50] Uri C. Weiser,et al. MMX technology extension to the Intel architecture , 1996, IEEE Micro.

[51] Chris Wilkerson,et al. Hierarchical scheduling windows , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[52] Jose Renau,et al. Power model validation through thermal measurements , 2007, ISCA '07.

[53] Tor M. Aamodt,et al. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[54] Erik Lindholm,et al. NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[55] Sang H. Dhong,et al. Microarchitecture and implementation of the synergistic processor in 65-nm and 90-nm SOI , 2007, IBM J. Res. Dev..

[56] Marc Tremblay,et al. The visual instruction set (VIS) in UltraSPARC , 1995, Digest of Papers. COMPCON'95. Technologies for the Information Superhighway.

[57] Gabriel H. Loh,et al. The Cost of Uncore in Throughput-Oriented Many-Core Processors , 2008 .

[58] R. Kumar,et al. An Integrated Quad-Core Opteron Processor , 2007, 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.

[59] Norman P. Jouppi,et al. Conjoined-Core Chip Multiprocessing , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[60] André Seznec,et al. CASH: Revisiting Hardware Sharing in Single-Chip Parallel Processors , 2004, J. Instr. Level Parallelism.

[61] Roni Rosner,et al. Specialized dynamic optimizations for high-performance energy-efficient microarchitecture , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[62] Timothy Johnson,et al. An 8-core, 64-thread, 64-bit power efficient sparc soc (niagara2) , 2007, ISPD '07.

[63] Kunle Olukotun,et al. Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.

[64] Milo M. K. Martin,et al. Scalable store-load forwarding via store queue index prediction , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[65] S. Winkel. Optimal versus Heuristic Global Code Scheduling , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[66] Kai Li,et al. Retrospective: virtual memory mapped network interface for the SHRIMP multicomputer , 1994, ISCA '98.

[67] Mark D. Hill,et al. Amdahl's Law in the Multicore Era , 2008 .

[68] David Blythe. The Direct3D 10 system , 2006, SIGGRAPH 2006.

[69] Norman P. Jouppi,et al. The multicluster architecture: reducing cycle time through partitioning , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[70] Margaret Martonosi,et al. Wattch: a framework for architectural-level power analysis and optimizations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[71] Itsujiro Arita,et al. Revisiting Direct Tag Search Algorithm on Superscalar Processors , 2001 .

[72] Kunle Olukotun,et al. The case for a single-chip multiprocessor , 1996, ASPLOS VII.

[73] José González,et al. Distributed Cooperative Caching , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[74] Brad Calder,et al. Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[75] Gürhan Küçük,et al. Distributed reorder buffer schemes for low power , 2003, Proceedings 21st International Conference on Computer Design.

[76] Ahmed Sameh,et al. The Illiac IV system , 1972 .

[77] Norman P. Jouppi. Cache write policies and performance , 1993, ISCA '93.

[78] Jens H. Krüger,et al. A Survey of General‐Purpose Computation on Graphics Hardware , 2007, Eurographics.

[79] Henry Wong,et al. Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[80] B. Fagin. Partial Resolution in Branch Target Buffers , 1997, IEEE Trans. Computers.

[81] Haitham Akkary,et al. Scalable load and store processing in latency tolerant processors , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[82] Milo M. K. Martin,et al. NoSQ: Store-Load Communication without a Store Queue , 2007, IEEE Micro.

[83] Justin P. Haldar,et al. Accelerating advanced mri reconstructions on gpus , 2008, CF '08.

[84] Scott A. Mahlke,et al. Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-thread Applications , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[85] K. Steinhubl. Design of Ion-Implanted MOSFET'S with Very Small Physical Dimensions , 1974 .

[86] Moinuddin K. Qureshi. Adaptive Spill-Receive for robust high-performance caching in CMPs , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[87] Luiz André Barroso,et al. Piranha: a scalable architecture based on single-chip multiprocessing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[88] Craig B. Zilles,et al. Fundamental performance constraints in horizontal fusion of in-order cores , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[89] Joshua A. Anderson,et al. General purpose molecular dynamics simulations fully implemented on graphics processing units , 2008, J. Comput. Phys..

[90] Dezsö Sima,et al. The Design Space of Register Renaming Techniques , 2000, IEEE Micro.

[91] Kevin Skadron,et al. A performance study of general-purpose applications on graphics processors using CUDA , 2008, J. Parallel Distributed Comput..

[92] Martin Burtscher,et al. Efficient emulation of hardware prefetchers via event-driven helper threading , 2006, 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[93] Kevin Skadron,et al. Accelerating leukocyte tracking using CUDA: A case study in leveraging manycore coprocessors , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[94] P. Stenstrom. A survey of cache coherence schemes for multiprocessors , 1990, Computer.

[95] Jung Ho Ahn,et al. Merrimac: Supercomputing with Streams , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[96] Justin P. Haldar,et al. Accelerating advanced MRI reconstructions on GPUs , 2008, J. Parallel Distributed Comput..

[97] Krste Asanovic,et al. RingScalar: A Complexity-Effective Out-of-Order Superscalar Microarchitecture , 2006 .

[98] Michael C. Huang,et al. Substituting Associative Load Queue with Simple Hash Tables in Out-of-Order Microprocessors , 2006, ISLPED'06 Proceedings of the 2006 International Symposium on Low Power Electronics and Design.

[99] David J. Sager,et al. The microarchitecture of the Pentium 4 processor , 2001 .

[100] Vivek Sarkar,et al. Baring It All to Software: Raw Machines , 1997, Computer.

[101] G.E. Moore,et al. Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.

[102] Dirk Grunwald,et al. Next cache line and set prediction , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[103] Simha Sethumadhavan,et al. Scalable Hardware Memory Disambiguation for High-ILP Processors , 2004, IEEE Micro.

[104] Onur Mutlu,et al. Runahead execution: an alternative to very large instruction windows for out-of-order processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[105] Richard E. Kessler,et al. The Alpha 21264 microprocessor architecture , 1998, Proceedings International Conference on Computer Design. VLSI in Computers and Processors (Cat. No.98CB36273).

[106] L. W. Tucker,et al. Architecture and applications of the Connection Machine , 1988, Computer.

[107] Masahiro Goshima,et al. A high-speed dynamic instruction scheduling scheme for superscalar processors , 2001, MICRO.

[108] Gurindar S. Sohi. 25 Years of the International Symposia on Computer Architecture (Selected Papers) , 1998, ISCA Selected Papers.

[109] H. Peter Hofstee,et al. Introduction to the Cell multiprocessor , 2005, IBM J. Res. Dev..

[110] Rohit Bhatia,et al. Montecito: a dual-core, dual-thread Itanium processor , 2005, IEEE Micro.

[111] Jaehyuk Huh,et al. A NUCA substrate for flexible CMP cache sharing , 2005, ICS.

[112] Kevin Skadron,et al. Federation: Out-of-Order Execution using Simple In-Order Cores , 2007 .

[113] Aamer Jaleel,et al. Adaptive insertion policies for high performance caching , 2007, ISCA '07.