Efficient Data Supply for Parallel Heterogeneous Architectures
暂无分享,去创建一个
[1] Alasdair Rawsthorne,et al. The effectiveness of decoupling , 1993, ICS '93.
[2] Donald Yeung,et al. Design and evaluation of compiler algorithms for pre-execution , 2002, ASPLOS X.
[3] Margaret Martonosi,et al. DeSC: Decoupled supply-compute communication management for heterogeneous architectures , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[4] Craig Zilles,et al. Execution-based prediction using speculative slices , 2001, ISCA 2001.
[5] Jack L. Lo,et al. Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).
[6] Sanjay J. Patel,et al. OUTRIDER: Efficient memory latency tolerance with decoupled strands , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).
[7] Tao Chen,et al. Efficient data supply for hardware accelerators with prefetching and access/execute decoupling , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[8] Douglas J. Joseph,et al. Prefetching Using Markov Predictors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.
[9] Huiyang Zhou,et al. Dual-core execution: building a highly scalable single-thread instruction window , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).
[10] Onur Mutlu,et al. Runahead execution: an alternative to very large instruction windows for out-of-order processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..
[11] Weifeng Zhang,et al. Accelerating and Adapting Precomputation Threads for Effcient Prefetching , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.
[12] Onur Mutlu,et al. Accelerating Dependent Cache Misses with an Enhanced Memory Controller , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[13] James E. Smith,et al. Data Cache Prefetching Using a Global History Buffer , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).
[14] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).
[15] David A. Wood,et al. WiDGET: Wisconsin decoupled grid execution tiles , 2010, ISCA.
[16] Karthikeyan Sankaralingam,et al. Stream-dataflow acceleration , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).
[17] Jung Ho Ahn,et al. CACTI-P: Architecture-level modeling for SRAM-based structures with advanced leakage reduction techniques , 2011, 2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).
[18] Eric Rotenberg,et al. A large, fast instruction window for tolerating cache misses , 2002, ISCA.
[19] Karthikeyan Sankaralingam,et al. Efficient execution of memory access phases using dataflow specialization , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).
[20] Margaret Martonosi,et al. Decoupling Data Supply from Computation for Latency-Tolerant Communication in Heterogeneous Architectures , 2017, ACM Trans. Archit. Code Optim..
[21] Norman P. Jouppi,et al. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.
[22] David I. August,et al. Decoupled software pipelining with the synchronization array , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..
[23] Christoforos E. Kozyrakis,et al. Learning Memory Access Patterns , 2018, ICML.
[24] Stijn Eyerman,et al. Restating the Case for Weighted-IPC Metrics to Evaluate Multiprogram Workload Performance , 2014, IEEE Computer Architecture Letters.
[25] Haitham Akkary,et al. Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors , 2003, MICRO.
[26] David Black-Schaffer,et al. Fix the code. Don't tweak the hardware: A new compiler approach to Voltage-Frequency scaling , 2014, CGO '14.
[27] Wen-mei W. Hwu,et al. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .
[28] Francisco J. Cazorla,et al. A Flexible Heterogeneous Multi-Core Architecture , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).
[29] Onur Mutlu,et al. Continuous runahead: Transparent hardware acceleration for memory intensive workloads , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[30] Alasdair Rawsthorne,et al. Compiling and Optimizing for Decoupled Architectures , 1995, Proceedings of the IEEE/ACM SC95 Conference.
[31] Marc Tremblay,et al. Simultaneous speculative threading: a novel pipeline architecture implemented in sun's rock processor , 2009, ISCA '09.
[32] Amir Roth,et al. BOLT: Energy-efficient Out-of-Order Latency-Tolerant execution , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.
[33] Mateo Valero,et al. A decoupled KILO-instruction processor , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..
[34] A. Djabelkhir,et al. Characterization of embedded applications for decoupled processor architecture , 2003, 2003 IEEE International Conference on Communications (Cat. No.03CH37441).
[35] Thomas F. Wenisch,et al. Efficiently Scaling Out-of-Order Cores for Simultaneous Multithreading , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[36] K. Sundaramoorthy,et al. Slipstream processors: improving both performance and fault tolerance , 2000, SIGP.
[37] Yale N. Patt,et al. Simultaneous subordinate microthreading (SSMT) , 1999, ISCA.
[38] Richard E. Kessler,et al. Evaluating stream buffers as a secondary cache replacement , 1994, Proceedings of 21 International Symposium on Computer Architecture.
[39] Jan Reineke,et al. Ascertaining Uncertainty for Efficient Exact Cache Analysis , 2017, CAV.
[40] Calvin Lin,et al. Linearizing irregular memory accesses for improved correlated prefetching , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[41] Mateo Valero,et al. Toward kilo-instruction processors , 2004, TACO.
[42] Michael C. Huang,et al. DRUT: An Efficient Turbo Boost Solution via Load Balancing in Decoupled Look-Ahead Architecture , 2017, 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT).
[43] Michael C. Huang,et al. A performance-correctness explicitly-decoupled architecture , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.
[44] Haitham Akkary,et al. Continual flow pipelines , 2004, ASPLOS XI.
[45] Stijn Eyerman,et al. An Evaluation of High-Level Mechanistic Core Models , 2014, ACM Trans. Archit. Code Optim..
[46] Dean M. Tullsen,et al. Inter-core prefetching for multicore processors using migrating helper threads , 2011, ASPLOS XVI.
[47] Sally A. McKee,et al. Hitting the memory wall: implications of the obvious , 1995, CARN.
[48] Yale N. Patt,et al. MorphCore: An Energy-Efficient Microarchitecture for High Performance ILP and High Throughput TLP , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.
[49] Jung Ho Ahn,et al. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[50] Wei-Chung Hsu,et al. Dynamic helper threaded prefetching on the Sun UltraSPARC/spl reg/ CMP processor , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).
[51] David W. Binkley,et al. Program slicing , 2008, 2008 Frontiers of Software Maintenance.
[52] Lieven Eeckhout,et al. Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).