Efficient throughput cores for asymmetric manycore processors

The microprocessor industry has had to switch from developing ever more complex and more deeply pipelined single-core processors to multicore processors due to running into power, thermal and complexity limits. Future microprocessors will be asymmetric manycore chip multiprocessors, with a small number of complex cores for serial programs and serial sections of parallel programs. The majority of the cores will be small, power- and area-efficient cores to maximize overall throughput in a limited power budget. The main contributions of this dissertation are techniques for improving the performance and area-efficiency of these throughput-oriented cores. This work shows how the single-thread performance of small, scalar cores can be increased or dynamically combined to speed up programs with only a limited number of parallel threads. It also shows how to improve both the cores and the cache subsystem of multicore processor using SIMD cores.

[1]  Kunle Olukotun,et al.  Maximizing CMP throughput with mediocre cores , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[2]  Allan Porterfield,et al.  The Tera computer system , 1990 .

[3]  Balaram Sinharoy,et al.  IBM Power5 chip: a dual-core multithreaded processor , 2004, IEEE Micro.

[4]  Per Stenström,et al.  A Survey of Cache Coherence Schemes for Multiprocessors , 1990, Computer.

[5]  Kevin Skadron,et al.  Leveraging Memory Level Parallelism Using Dynamic Warp Subdivision , 2009 .

[6]  Engin Ipek,et al.  Core fusion: accommodating software diversity in chip multiprocessors , 2007, ISCA '07.

[7]  Kevin Skadron,et al.  Federation: Repurposing scalar cores for out-of-order instruction issue , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[8]  Pat Hanrahan,et al.  Brook for GPUs: stream computing on graphics hardware , 2004, SIGGRAPH 2004.

[9]  Marcelo Yuffe,et al.  The Implementation of the 65nm Dual-Core 64b Merom Processor , 2007, 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.

[10]  Berkin Özisikyilmaz,et al.  MineBench: A Benchmark Suite for Data Mining Workloads , 2006, 2006 IEEE International Symposium on Workload Characterization.

[11]  Takayasu Sakurai,et al.  A simple MOSFET model for circuit analysis , 1991 .

[12]  Aaftab Munshi,et al.  The OpenCL specification , 2009, 2009 IEEE Hot Chips 21 Symposium (HCS).

[13]  Amir Roth,et al.  Store vulnerability window (SVW): re-execution filtering for enhanced load optimization , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[14]  Stéphan Jourdan,et al.  An Exploration of Instruction Fetch Requirement in Out-of-Order Superscalar Processors , 2004, International Journal of Parallel Programming.

[15]  Lizy Kurian John,et al.  Scaling to the end of silicon with EDGE architectures , 2004, Computer.

[16]  Philippe Roussel,et al.  The microarchitecture of the intel pentium 4 processor on 90nm technology , 2004 .

[17]  Yao Zhang,et al.  Parallel Computing Experiences with CUDA , 2008, IEEE Micro.

[18]  R.H. Dennard,et al.  Design Of Ion-implanted MOSFET's with Very Small Physical Dimensions , 1974, Proceedings of the IEEE.

[19]  Amitabh Varshney,et al.  High-throughput sequence alignment using Graphics Processing Units , 2007, BMC Bioinformatics.

[20]  Willy Zwaenepoel,et al.  Implementation and performance of Munin , 1991, SOSP '91.

[21]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[22]  Onur Mutlu,et al.  Accelerating critical section execution with asymmetric multi-core architectures , 2009, ASPLOS.

[23]  D.H. Albonesi,et al.  An oldest-first selection logic implementation for non-compacting issue queues [microprocessor power reduction] , 2002, 15th Annual IEEE International ASIC/SOC Conference.

[24]  Doug Burger,et al.  Evaluating Future Microprocessors: the SimpleScalar Tool Set , 1996 .

[25]  Shreekant S. Thakkar,et al.  Internet Streaming SIMD Extensions , 1999, Computer.

[26]  H. Peter Hofstee,et al.  Power efficient processor architecture and the cell processor , 2005, 11th International Symposium on High-Performance Computer Architecture.

[27]  Richard M. Russell,et al.  The CRAY-1 computer system , 1978, CACM.

[28]  M. K. Gowan,et al.  A 65 nm 2-Billion Transistor Quad-Core Itanium Processor , 2009, IEEE Journal of Solid-State Circuits.

[29]  Kevin Skadron,et al.  Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[30]  Amir Roth Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load Optimization , 2005, ISCA 2005.

[31]  E. Fluhr,et al.  Design and Implementation of the POWER6 Microprocessor , 2008, IEEE Journal of Solid-State Circuits.

[32]  Vikas Agarwal,et al.  Clock rate versus IPC: the end of the road for conventional microarchitectures , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[33]  John Paul Shen,et al.  Best of both latency and throughput , 2004, IEEE International Conference on Computer Design: VLSI in Computers and Processors, 2004. ICCD 2004. Proceedings..

[34]  Kai Li,et al.  Virtual-Memory-Mapped Network Interfaces , 1995, IEEE Micro.

[35]  Gurindar S. Sohi,et al.  Characterizing and predicting value degree of use , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[36]  Norman P. Jouppi,et al.  Single-ISA heterogeneous multi-core architectures for multithreaded workload performance , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[37]  Fred Weber,et al.  AMD 3DNow! technology: architecture and implementations , 1999, IEEE Micro.

[38]  William J. Dally,et al.  Programmable Stream Processors , 2003, Computer.

[39]  J. Torrellas,et al.  Energy-efficient hybrid wakeup logic , 2002, Proceedings of the International Symposium on Low Power Electronics and Design.

[40]  Gabriel H. Loh,et al.  Matrix scheduler reloaded , 2007, ISCA '07.

[41]  Doug Burger,et al.  An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[42]  Aaron Smith,et al.  Compiling for EDGE architectures , 2006, International Symposium on Code Generation and Optimization (CGO'06).

[43]  Pradeep Dubey,et al.  Larrabee: A Many-Core x86 Architecture for Visual Computing , 2009, IEEE Micro.

[44]  Jichuan Chang,et al.  Cooperative Caching for Chip Multiprocessors , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[45]  S. Tomita,et al.  A high-speed dynamic instruction scheduling scheme for supersealar processors , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[46]  Gabriel H. Loh,et al.  Fire-and-Forget: Load/Store Scheduling with No Store Queue at All , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[47]  M. Tremblay,et al.  UltraSparc I: a four-issue processor supporting multimedia , 1996, IEEE Micro.

[48]  H. Cheong,et al.  A cache coherence scheme with fast selective invalidation , 1988, [1988] The 15th Annual International Symposium on Computer Architecture. Conference Proceedings.

[49]  Brian Fahs,et al.  Microarchitecture optimizations for exploiting memory-level parallelism , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[50]  Uri C. Weiser,et al.  MMX technology extension to the Intel architecture , 1996, IEEE Micro.

[51]  Chris Wilkerson,et al.  Hierarchical scheduling windows , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[52]  Jose Renau,et al.  Power model validation through thermal measurements , 2007, ISCA '07.

[53]  Tor M. Aamodt,et al.  Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[54]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[55]  Sang H. Dhong,et al.  Microarchitecture and implementation of the synergistic processor in 65-nm and 90-nm SOI , 2007, IBM J. Res. Dev..

[56]  Marc Tremblay,et al.  The visual instruction set (VIS) in UltraSPARC , 1995, Digest of Papers. COMPCON'95. Technologies for the Information Superhighway.

[57]  Gabriel H. Loh,et al.  The Cost of Uncore in Throughput-Oriented Many-Core Processors , 2008 .

[58]  R. Kumar,et al.  An Integrated Quad-Core Opteron Processor , 2007, 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.

[59]  Norman P. Jouppi,et al.  Conjoined-Core Chip Multiprocessing , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[60]  André Seznec,et al.  CASH: Revisiting Hardware Sharing in Single-Chip Parallel Processors , 2004, J. Instr. Level Parallelism.

[61]  Roni Rosner,et al.  Specialized dynamic optimizations for high-performance energy-efficient microarchitecture , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[62]  Timothy Johnson,et al.  An 8-core, 64-thread, 64-bit power efficient sparc soc (niagara2) , 2007, ISPD '07.

[63]  Kunle Olukotun,et al.  Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.

[64]  Milo M. K. Martin,et al.  Scalable store-load forwarding via store queue index prediction , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[65]  S. Winkel Optimal versus Heuristic Global Code Scheduling , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[66]  Kai Li,et al.  Retrospective: virtual memory mapped network interface for the SHRIMP multicomputer , 1994, ISCA '98.

[67]  Mark D. Hill,et al.  Amdahl's Law in the Multicore Era , 2008 .

[68]  David Blythe The Direct3D 10 system , 2006, SIGGRAPH 2006.

[69]  Norman P. Jouppi,et al.  The multicluster architecture: reducing cycle time through partitioning , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[70]  Margaret Martonosi,et al.  Wattch: a framework for architectural-level power analysis and optimizations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[71]  Itsujiro Arita,et al.  Revisiting Direct Tag Search Algorithm on Superscalar Processors , 2001 .

[72]  Kunle Olukotun,et al.  The case for a single-chip multiprocessor , 1996, ASPLOS VII.

[73]  José González,et al.  Distributed Cooperative Caching , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[74]  Brad Calder,et al.  Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[75]  Gürhan Küçük,et al.  Distributed reorder buffer schemes for low power , 2003, Proceedings 21st International Conference on Computer Design.

[76]  Ahmed Sameh,et al.  The Illiac IV system , 1972 .

[77]  Norman P. Jouppi Cache write policies and performance , 1993, ISCA '93.

[78]  Jens H. Krüger,et al.  A Survey of General‐Purpose Computation on Graphics Hardware , 2007, Eurographics.

[79]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[80]  B. Fagin Partial Resolution in Branch Target Buffers , 1997, IEEE Trans. Computers.

[81]  Haitham Akkary,et al.  Scalable load and store processing in latency tolerant processors , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[82]  Milo M. K. Martin,et al.  NoSQ: Store-Load Communication without a Store Queue , 2007, IEEE Micro.

[83]  Justin P. Haldar,et al.  Accelerating advanced mri reconstructions on gpus , 2008, CF '08.

[84]  Scott A. Mahlke,et al.  Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-thread Applications , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[85]  K. Steinhubl Design of Ion-Implanted MOSFET'S with Very Small Physical Dimensions , 1974 .

[86]  Moinuddin K. Qureshi Adaptive Spill-Receive for robust high-performance caching in CMPs , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[87]  Luiz André Barroso,et al.  Piranha: a scalable architecture based on single-chip multiprocessing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[88]  Craig B. Zilles,et al.  Fundamental performance constraints in horizontal fusion of in-order cores , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[89]  Joshua A. Anderson,et al.  General purpose molecular dynamics simulations fully implemented on graphics processing units , 2008, J. Comput. Phys..

[90]  Dezsö Sima,et al.  The Design Space of Register Renaming Techniques , 2000, IEEE Micro.

[91]  Kevin Skadron,et al.  A performance study of general-purpose applications on graphics processors using CUDA , 2008, J. Parallel Distributed Comput..

[92]  Martin Burtscher,et al.  Efficient emulation of hardware prefetchers via event-driven helper threading , 2006, 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[93]  Kevin Skadron,et al.  Accelerating leukocyte tracking using CUDA: A case study in leveraging manycore coprocessors , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[94]  P. Stenstrom A survey of cache coherence schemes for multiprocessors , 1990, Computer.

[95]  Jung Ho Ahn,et al.  Merrimac: Supercomputing with Streams , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[96]  Justin P. Haldar,et al.  Accelerating advanced MRI reconstructions on GPUs , 2008, J. Parallel Distributed Comput..

[97]  Krste Asanovic,et al.  RingScalar: A Complexity-Effective Out-of-Order Superscalar Microarchitecture , 2006 .

[98]  Michael C. Huang,et al.  Substituting Associative Load Queue with Simple Hash Tables in Out-of-Order Microprocessors , 2006, ISLPED'06 Proceedings of the 2006 International Symposium on Low Power Electronics and Design.

[99]  David J. Sager,et al.  The microarchitecture of the Pentium 4 processor , 2001 .

[100]  Vivek Sarkar,et al.  Baring It All to Software: Raw Machines , 1997, Computer.

[101]  G.E. Moore,et al.  Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.

[102]  Dirk Grunwald,et al.  Next cache line and set prediction , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[103]  Simha Sethumadhavan,et al.  Scalable Hardware Memory Disambiguation for High-ILP Processors , 2004, IEEE Micro.

[104]  Onur Mutlu,et al.  Runahead execution: an alternative to very large instruction windows for out-of-order processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[105]  Richard E. Kessler,et al.  The Alpha 21264 microprocessor architecture , 1998, Proceedings International Conference on Computer Design. VLSI in Computers and Processors (Cat. No.98CB36273).

[106]  L. W. Tucker,et al.  Architecture and applications of the Connection Machine , 1988, Computer.

[107]  Masahiro Goshima,et al.  A high-speed dynamic instruction scheduling scheme for superscalar processors , 2001, MICRO.

[108]  Gurindar S. Sohi 25 Years of the International Symposia on Computer Architecture (Selected Papers) , 1998, ISCA Selected Papers.

[109]  H. Peter Hofstee,et al.  Introduction to the Cell multiprocessor , 2005, IBM J. Res. Dev..

[110]  Rohit Bhatia,et al.  Montecito: a dual-core, dual-thread Itanium processor , 2005, IEEE Micro.

[111]  Jaehyuk Huh,et al.  A NUCA substrate for flexible CMP cache sharing , 2005, ICS.

[112]  Kevin Skadron,et al.  Federation: Out-of-Order Execution using Simple In-Order Cores , 2007 .

[113]  Aamer Jaleel,et al.  Adaptive insertion policies for high performance caching , 2007, ISCA '07.