SWOOP: software-hardware co-design for non-speculative, execute-ahead, in-order cores
暂无分享,去创建一个
Stefanos Kaxiras | Magnus Själander | Trevor E. Carlson | Alexandra Jimborean | Konstantinos Koukos | Kim-Anh Tran | S. Kaxiras | Magnus Själander | Kim-Anh Tran | K. Koukos | A. Jimborean
[1] Sebastian Winkel,et al. Latency-tolerant software pipelining in a production compiler , 2008, CGO '08.
[2] Chi-Keung Luk,et al. Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors , 2001, Proceedings 28th Annual International Symposium on Computer Architecture.
[3] Scott B. Baden,et al. Redefining the Role of the CPU in the Era of CPU-GPU Integration , 2012, IEEE Micro.
[4] Margaret Martonosi,et al. DeSC: Decoupled supply-compute communication management for heterogeneous architectures , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[5] David I. August,et al. Decoupled software pipelining with the synchronization array , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..
[6] Victor V. Zyuban,et al. Inherently Lower-Power High-Performance Superscalar Architectures , 2001, IEEE Trans. Computers.
[7] Carole-Jean Wu,et al. SHiP: Signature-based Hit Predictor for high performance caching , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[8] Srinivas Devadas,et al. IMP: Indirect memory prefetcher , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[9] Andrew A. Chien,et al. The future of microprocessors , 2011, Commun. ACM.
[10] Gurindar S. Sohi,et al. Speculative data-driven multithreading , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.
[11] Onur Mutlu,et al. Runahead execution: an alternative to very large instruction windows for out-of-order processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..
[12] Weifeng Zhang,et al. Accelerating and Adapting Precomputation Threads for Effcient Prefetching , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.
[13] Jan Reineke,et al. Ascertaining Uncertainty for Efficient Exact Cache Analysis , 2017, CAV.
[14] Henk Corporaal,et al. High-level software-pipelining in LLVM , 2015, SCOPES.
[15] Alexander Aiken,et al. Resource-Constrained Software Pipelining , 1995, IEEE Trans. Parallel Distributed Syst..
[16] Sam Ainsworth,et al. Software prefetching for indirect memory accesses , 2017, 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).
[17] Gurindar S. Sohi,et al. Multiscalar processors , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.
[18] David I. August,et al. Decoupled software pipelining with the synchronization array , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..
[19] Carlo H. Séquin,et al. Design and Implementation of RISC I , 1982 .
[20] Yale N. Patt,et al. Achieving Out-of-Order Performance with Almost In-Order Complexity , 2008, 2008 International Symposium on Computer Architecture.
[21] Dean M. Tullsen,et al. Inter-core prefetching for multicore processors using migrating helper threads , 2011, ASPLOS XVI.
[22] Stéphan Jourdan,et al. Speculation techniques for improving load related instruction scheduling , 1999, ISCA.
[23] Eric Rotenberg,et al. Slipstream processors: improving both performance and fault tolerance , 2000, SIGP.
[24] Pen-Chung Yew,et al. A Scheme to Enforce Data Dependence on Large Multiprocessor Systems , 1987, IEEE Trans. Software Eng..
[25] Margaret Martonosi,et al. Informing Memory Operations: Providing Memory Performance Feedback in Modern Processors , 1996, ISCA.
[26] Marc Tremblay,et al. A Third-Generation 65nm 16-Core 32-Thread Plus 32-Scout-Thread CMT SPARC® Processor , 2008, 2008 IEEE International Solid-State Circuits Conference - Digest of Technical Papers.
[27] Scott A. Mahlke,et al. Effective compiler support for predicated execution using the hyperblock , 1992, MICRO 25.
[28] Trevor Mudge,et al. Improving data cache performance by pre-executing instructions under a cache miss , 1997 .
[29] M. Hill,et al. Weak ordering-a new definition , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.
[30] Manoj Franklin,et al. The multiscalar architecture , 1993 .
[31] Robert H. Dennard,et al. A 30 Year Retrospective on Dennard's MOSFET Scaling Paper , 2007 .
[32] Richard W. Vuduc,et al. When Prefetching Works, When It Doesn’t, and Why , 2012, TACO.
[33] Dean M. Tullsen,et al. Mitosis compiler: an infrastructure for speculative threading based on pre-computation slices , 2005, PLDI '05.
[34] Thomas M. Conte,et al. High-performance and low-cost dual-thread VLIW processor using Weld architecture paradigm , 2005, IEEE Transactions on Parallel and Distributed Systems.
[35] Stefanos Kaxiras,et al. Non-speculative load-load reordering in TSO , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).
[36] Haitham Akkary,et al. Continual flow pipelines , 2004, ASPLOS XI.
[37] Stijn Eyerman,et al. An Evaluation of High-Level Mechanistic Core Models , 2014, ACM Trans. Archit. Code Optim..
[38] David Black-Schaffer,et al. Navigating the cache hierarchy with a single lookup , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).
[39] B. Ramakrishna Rau,et al. Data Flow and Dependence Analysis for Instruction Level Parallelism , 1991, LCPC.
[40] Eric Rotenberg,et al. Control-Flow Decoupling: An Approach for Timely, Non-Speculative Branching , 2015, IEEE Transactions on Computers.
[41] David W. Binkley,et al. Program slicing , 2008, 2008 Frontiers of Software Maintenance.
[42] Onur Mutlu,et al. A Case for MLP-Aware Cache Replacement , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).
[43] Lieven Eeckhout,et al. Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[44] Haitham Akkary,et al. A simple latency tolerant processor , 2008, 2008 IEEE International Conference on Computer Design.
[45] Erik Hagersten,et al. Resource conscious prefetching for irregular applications in multicores , 2014, 2014 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIV).
[46] B. R. Rau,et al. HPL-PD Architecture Specification:Version 1.1 , 2000 .
[47] Stefanos Kaxiras,et al. Multiversioned decoupled access-execute: the key to energy-efficient compilation of general-purpose programs , 2016, CC.
[48] M. Dubois,et al. Assisted Execution , 1998 .
[49] Weng-Fai Wong,et al. Static identification of delinquent loads , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..
[50] Lieven Eeckhout,et al. The Load Slice Core microarchitecture , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).
[51] Erik Hagersten,et al. A Case for Resource Efficient Prefetching in Multicores , 2014, ICPP.
[52] Jung Ho Ahn,et al. The McPAT Framework for Multicore and Manycore Architectures: Simultaneously Modeling Power, Area, and Timing , 2013, TACO.
[53] John Paul Shen,et al. Dynamic speculative precomputation , 2001, MICRO.
[54] Aleksandar Milenkovic,et al. Experiment flows and microbenchmarks for reverse engineering of branch predictor structures , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.
[55] Guilherme Ottoni,et al. Automatic thread extraction with decoupled software pipelining , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).
[56] Koji Nii,et al. A 28 nm High-k/MG Heterogeneous Multi-Core Mobile Application Processor With 2 GHz Cores and Low-Power 1 GHz Cores , 2015, IEEE Journal of Solid-State Circuits.
[57] Krishna V. Palem,et al. Adaptive Compiler Directed Prefetching for EPIC Processors , 2004, PDPTA.
[58] David J. Lilja,et al. Data prefetch mechanisms , 2000, CSUR.
[59] Craig Zilles,et al. Execution-based prediction using speculative slices , 2001, ISCA 2001.
[60] Sanjay J. Patel,et al. OUTRIDER: Efficient memory latency tolerance with decoupled strands , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).
[61] David Black-Schaffer,et al. Fix the code. Don't tweak the hardware: A new compiler approach to Voltage-Frequency scaling , 2014, CGO '14.
[62] Santosh Nagarakatte,et al. iCFP: Tolerating All-Level Cache Misses in In-Order Processors , 2010, IEEE Micro.
[63] John Paul Shen,et al. Speculative precomputation: long-range prefetching of delinquent loads , 2001, Proceedings 28th Annual International Symposium on Computer Architecture.
[64] David A. Patterson,et al. Computer Architecture, Fifth Edition: A Quantitative Approach , 2011 .
[65] Marc Tremblay,et al. Simultaneous speculative threading: a novel pipeline architecture implemented in sun's rock processor , 2009, ISCA '09.
[66] Gurindar S. Sohi,et al. Task selection for a multiscalar processor , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.
[67] John L. Henning. SPEC CPU2006 benchmark descriptions , 2006, CARN.
[68] Amir Roth,et al. BOLT: Energy-efficient Out-of-Order Latency-Tolerant execution , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.
[69] Juan Touriño,et al. An Inspector-Executor Algorithm for Irregular Assignment Parallelization , 2004, ISPA.
[70] David Black-Schaffer,et al. AREP: Adaptive Resource Efficient Prefetching for Maximizing Multicore Performance , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).
[71] Antonio González,et al. Energy-effective issue logic , 2001, ISCA 2001.
[72] Stefanos Kaxiras,et al. Clairvoyance: Look-ahead compile-time scheduling , 2017, 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).
[73] Sally A. McKee,et al. Hitting the memory wall: implications of the obvious , 1995, CARN.
[74] Vijayalakshmi Srinivasan,et al. Exploring the limits of prefetching , 2005, IBM J. Res. Dev..