Efficient throughput cores for asymmetric manycore processors
暂无分享,去创建一个
[1] Kunle Olukotun,et al. Maximizing CMP throughput with mediocre cores , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).
[2] Allan Porterfield,et al. The Tera computer system , 1990 .
[3] Balaram Sinharoy,et al. IBM Power5 chip: a dual-core multithreaded processor , 2004, IEEE Micro.
[4] Per Stenström,et al. A Survey of Cache Coherence Schemes for Multiprocessors , 1990, Computer.
[5] Kevin Skadron,et al. Leveraging Memory Level Parallelism Using Dynamic Warp Subdivision , 2009 .
[6] Engin Ipek,et al. Core fusion: accommodating software diversity in chip multiprocessors , 2007, ISCA '07.
[7] Kevin Skadron,et al. Federation: Repurposing scalar cores for out-of-order instruction issue , 2008, 2008 45th ACM/IEEE Design Automation Conference.
[8] Pat Hanrahan,et al. Brook for GPUs: stream computing on graphics hardware , 2004, SIGGRAPH 2004.
[9] Marcelo Yuffe,et al. The Implementation of the 65nm Dual-Core 64b Merom Processor , 2007, 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.
[10] Berkin Özisikyilmaz,et al. MineBench: A Benchmark Suite for Data Mining Workloads , 2006, 2006 IEEE International Symposium on Workload Characterization.
[11] Takayasu Sakurai,et al. A simple MOSFET model for circuit analysis , 1991 .
[12] Aaftab Munshi,et al. The OpenCL specification , 2009, 2009 IEEE Hot Chips 21 Symposium (HCS).
[13] Amir Roth,et al. Store vulnerability window (SVW): re-execution filtering for enhanced load optimization , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).
[14] Stéphan Jourdan,et al. An Exploration of Instruction Fetch Requirement in Out-of-Order Superscalar Processors , 2004, International Journal of Parallel Programming.
[15] Lizy Kurian John,et al. Scaling to the end of silicon with EDGE architectures , 2004, Computer.
[16] Philippe Roussel,et al. The microarchitecture of the intel pentium 4 processor on 90nm technology , 2004 .
[17] Yao Zhang,et al. Parallel Computing Experiences with CUDA , 2008, IEEE Micro.
[18] R.H. Dennard,et al. Design Of Ion-implanted MOSFET's with Very Small Physical Dimensions , 1974, Proceedings of the IEEE.
[19] Amitabh Varshney,et al. High-throughput sequence alignment using Graphics Processing Units , 2007, BMC Bioinformatics.
[20] Willy Zwaenepoel,et al. Implementation and performance of Munin , 1991, SOSP '91.
[21] Samuel Williams,et al. The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .
[22] Onur Mutlu,et al. Accelerating critical section execution with asymmetric multi-core architectures , 2009, ASPLOS.
[23] D.H. Albonesi,et al. An oldest-first selection logic implementation for non-compacting issue queues [microprocessor power reduction] , 2002, 15th Annual IEEE International ASIC/SOC Conference.
[24] Doug Burger,et al. Evaluating Future Microprocessors: the SimpleScalar Tool Set , 1996 .
[25] Shreekant S. Thakkar,et al. Internet Streaming SIMD Extensions , 1999, Computer.
[26] H. Peter Hofstee,et al. Power efficient processor architecture and the cell processor , 2005, 11th International Symposium on High-Performance Computer Architecture.
[27] Richard M. Russell,et al. The CRAY-1 computer system , 1978, CACM.
[28] M. K. Gowan,et al. A 65 nm 2-Billion Transistor Quad-Core Itanium Processor , 2009, IEEE Journal of Solid-State Circuits.
[29] Kevin Skadron,et al. Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).
[30] Amir Roth. Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load Optimization , 2005, ISCA 2005.
[31] E. Fluhr,et al. Design and Implementation of the POWER6 Microprocessor , 2008, IEEE Journal of Solid-State Circuits.
[32] Vikas Agarwal,et al. Clock rate versus IPC: the end of the road for conventional microarchitectures , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).
[33] John Paul Shen,et al. Best of both latency and throughput , 2004, IEEE International Conference on Computer Design: VLSI in Computers and Processors, 2004. ICCD 2004. Proceedings..
[34] Kai Li,et al. Virtual-Memory-Mapped Network Interfaces , 1995, IEEE Micro.
[35] Gurindar S. Sohi,et al. Characterizing and predicting value degree of use , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..
[36] Norman P. Jouppi,et al. Single-ISA heterogeneous multi-core architectures for multithreaded workload performance , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..
[37] Fred Weber,et al. AMD 3DNow! technology: architecture and implementations , 1999, IEEE Micro.
[38] William J. Dally,et al. Programmable Stream Processors , 2003, Computer.
[39] J. Torrellas,et al. Energy-efficient hybrid wakeup logic , 2002, Proceedings of the International Symposium on Low Power Electronics and Design.
[40] Gabriel H. Loh,et al. Matrix scheduler reloaded , 2007, ISCA '07.
[41] Doug Burger,et al. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.
[42] Aaron Smith,et al. Compiling for EDGE architectures , 2006, International Symposium on Code Generation and Optimization (CGO'06).
[43] Pradeep Dubey,et al. Larrabee: A Many-Core x86 Architecture for Visual Computing , 2009, IEEE Micro.
[44] Jichuan Chang,et al. Cooperative Caching for Chip Multiprocessors , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).
[45] S. Tomita,et al. A high-speed dynamic instruction scheduling scheme for supersealar processors , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.
[46] Gabriel H. Loh,et al. Fire-and-Forget: Load/Store Scheduling with No Store Queue at All , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).
[47] M. Tremblay,et al. UltraSparc I: a four-issue processor supporting multimedia , 1996, IEEE Micro.
[48] H. Cheong,et al. A cache coherence scheme with fast selective invalidation , 1988, [1988] The 15th Annual International Symposium on Computer Architecture. Conference Proceedings.
[49] Brian Fahs,et al. Microarchitecture optimizations for exploiting memory-level parallelism , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..
[50] Uri C. Weiser,et al. MMX technology extension to the Intel architecture , 1996, IEEE Micro.
[51] Chris Wilkerson,et al. Hierarchical scheduling windows , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..
[52] Jose Renau,et al. Power model validation through thermal measurements , 2007, ISCA '07.
[53] Tor M. Aamodt,et al. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).
[54] Erik Lindholm,et al. NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.
[55] Sang H. Dhong,et al. Microarchitecture and implementation of the synergistic processor in 65-nm and 90-nm SOI , 2007, IBM J. Res. Dev..
[56] Marc Tremblay,et al. The visual instruction set (VIS) in UltraSPARC , 1995, Digest of Papers. COMPCON'95. Technologies for the Information Superhighway.
[57] Gabriel H. Loh,et al. The Cost of Uncore in Throughput-Oriented Many-Core Processors , 2008 .
[58] R. Kumar,et al. An Integrated Quad-Core Opteron Processor , 2007, 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.
[59] Norman P. Jouppi,et al. Conjoined-Core Chip Multiprocessing , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).
[60] André Seznec,et al. CASH: Revisiting Hardware Sharing in Single-Chip Parallel Processors , 2004, J. Instr. Level Parallelism.
[61] Roni Rosner,et al. Specialized dynamic optimizations for high-performance energy-efficient microarchitecture , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..
[62] Timothy Johnson,et al. An 8-core, 64-thread, 64-bit power efficient sparc soc (niagara2) , 2007, ISPD '07.
[63] Kunle Olukotun,et al. Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.
[64] Milo M. K. Martin,et al. Scalable store-load forwarding via store queue index prediction , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).
[65] S. Winkel. Optimal versus Heuristic Global Code Scheduling , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).
[66] Kai Li,et al. Retrospective: virtual memory mapped network interface for the SHRIMP multicomputer , 1994, ISCA '98.
[67] Mark D. Hill,et al. Amdahl's Law in the Multicore Era , 2008 .
[68] David Blythe. The Direct3D 10 system , 2006, SIGGRAPH 2006.
[69] Norman P. Jouppi,et al. The multicluster architecture: reducing cycle time through partitioning , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.
[70] Margaret Martonosi,et al. Wattch: a framework for architectural-level power analysis and optimizations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).
[71] Itsujiro Arita,et al. Revisiting Direct Tag Search Algorithm on Superscalar Processors , 2001 .
[72] Kunle Olukotun,et al. The case for a single-chip multiprocessor , 1996, ASPLOS VII.
[73] José González,et al. Distributed Cooperative Caching , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).
[74] Brad Calder,et al. Automatically characterizing large scale program behavior , 2002, ASPLOS X.
[75] Gürhan Küçük,et al. Distributed reorder buffer schemes for low power , 2003, Proceedings 21st International Conference on Computer Design.
[76] Ahmed Sameh,et al. The Illiac IV system , 1972 .
[77] Norman P. Jouppi. Cache write policies and performance , 1993, ISCA '93.
[78] Jens H. Krüger,et al. A Survey of General‐Purpose Computation on Graphics Hardware , 2007, Eurographics.
[79] Henry Wong,et al. Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.
[80] B. Fagin. Partial Resolution in Branch Target Buffers , 1997, IEEE Trans. Computers.
[81] Haitham Akkary,et al. Scalable load and store processing in latency tolerant processors , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).
[82] Milo M. K. Martin,et al. NoSQ: Store-Load Communication without a Store Queue , 2007, IEEE Micro.
[83] Justin P. Haldar,et al. Accelerating advanced mri reconstructions on gpus , 2008, CF '08.
[84] Scott A. Mahlke,et al. Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-thread Applications , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.
[85] K. Steinhubl. Design of Ion-Implanted MOSFET'S with Very Small Physical Dimensions , 1974 .
[86] Moinuddin K. Qureshi. Adaptive Spill-Receive for robust high-performance caching in CMPs , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.
[87] Luiz André Barroso,et al. Piranha: a scalable architecture based on single-chip multiprocessing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).
[88] Craig B. Zilles,et al. Fundamental performance constraints in horizontal fusion of in-order cores , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.
[89] Joshua A. Anderson,et al. General purpose molecular dynamics simulations fully implemented on graphics processing units , 2008, J. Comput. Phys..
[90] Dezsö Sima,et al. The Design Space of Register Renaming Techniques , 2000, IEEE Micro.
[91] Kevin Skadron,et al. A performance study of general-purpose applications on graphics processors using CUDA , 2008, J. Parallel Distributed Comput..
[92] Martin Burtscher,et al. Efficient emulation of hardware prefetchers via event-driven helper threading , 2006, 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT).
[93] Kevin Skadron,et al. Accelerating leukocyte tracking using CUDA: A case study in leveraging manycore coprocessors , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.
[94] P. Stenstrom. A survey of cache coherence schemes for multiprocessors , 1990, Computer.
[95] Jung Ho Ahn,et al. Merrimac: Supercomputing with Streams , 2003, ACM/IEEE SC 2003 Conference (SC'03).
[96] Justin P. Haldar,et al. Accelerating advanced MRI reconstructions on GPUs , 2008, J. Parallel Distributed Comput..
[97] Krste Asanovic,et al. RingScalar: A Complexity-Effective Out-of-Order Superscalar Microarchitecture , 2006 .
[98] Michael C. Huang,et al. Substituting Associative Load Queue with Simple Hash Tables in Out-of-Order Microprocessors , 2006, ISLPED'06 Proceedings of the 2006 International Symposium on Low Power Electronics and Design.
[99] David J. Sager,et al. The microarchitecture of the Pentium 4 processor , 2001 .
[100] Vivek Sarkar,et al. Baring It All to Software: Raw Machines , 1997, Computer.
[101] G.E. Moore,et al. Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.
[102] Dirk Grunwald,et al. Next cache line and set prediction , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.
[103] Simha Sethumadhavan,et al. Scalable Hardware Memory Disambiguation for High-ILP Processors , 2004, IEEE Micro.
[104] Onur Mutlu,et al. Runahead execution: an alternative to very large instruction windows for out-of-order processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..
[105] Richard E. Kessler,et al. The Alpha 21264 microprocessor architecture , 1998, Proceedings International Conference on Computer Design. VLSI in Computers and Processors (Cat. No.98CB36273).
[106] L. W. Tucker,et al. Architecture and applications of the Connection Machine , 1988, Computer.
[107] Masahiro Goshima,et al. A high-speed dynamic instruction scheduling scheme for superscalar processors , 2001, MICRO.
[108] Gurindar S. Sohi. 25 Years of the International Symposia on Computer Architecture (Selected Papers) , 1998, ISCA Selected Papers.
[109] H. Peter Hofstee,et al. Introduction to the Cell multiprocessor , 2005, IBM J. Res. Dev..
[110] Rohit Bhatia,et al. Montecito: a dual-core, dual-thread Itanium processor , 2005, IEEE Micro.
[111] Jaehyuk Huh,et al. A NUCA substrate for flexible CMP cache sharing , 2005, ICS.
[112] Kevin Skadron,et al. Federation: Out-of-Order Execution using Simple In-Order Cores , 2007 .
[113] Aamer Jaleel,et al. Adaptive insertion policies for high performance caching , 2007, ISCA '07.