Efficient data supply for hardware accelerators with prefetching and access/execute decoupling
暂无分享,去创建一个
Tao Chen | G. Edward Suh | G. Suh | Tao Chen
[1] Kermin Fleming,et al. Leap scratchpads: automatic memory and cache management for reconfigurable logic , 2010, FPGA '11.
[2] Anoop Gupta,et al. Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.
[3] Steven Swanson,et al. Conservation cores: reducing the energy of mature computations , 2010, ASPLOS XV.
[4] Thomas F. Wenisch,et al. Thin servers with smart pipes: designing SoC accelerators for memcached , 2013, ISCA.
[5] John Wawrzynek,et al. Architectural synthesis of computational pipelines with decoupled memory access , 2014, 2014 International Conference on Field-Programmable Technology (FPT).
[6] Jeff Mason,et al. CHiMPS: A C-level compilation flow for hybrid CPU-FPGA architectures , 2008, 2008 International Conference on Field Programmable Logic and Applications.
[7] Kenneth A. Ross,et al. Navigating big data with high-throughput, energy-efficient data partitioning , 2013, ISCA.
[8] Christopher Batten,et al. PyMTL: A Unified Framework for Vertically Integrated Computer Architecture Research , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.
[9] Jean-Loup Baer,et al. An effective on-chip preloading scheme to reduce data access penalty , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).
[10] Karthikeyan Sankaralingam,et al. Dynamically Specialized Datapaths for energy efficient computing , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.
[11] Eric S. Chung,et al. LINQits: big data on little clients , 2013, ISCA.
[12] Tao Chen,et al. Execution time prediction for energy-efficient hardware accelerators , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[13] Kermin Fleming,et al. Optimizing under abstraction: Using prefetching to improve FPGA performance , 2013, 2013 23rd International Conference on Field programmable Logic and Applications.
[14] Zhiru Zhang,et al. Multithreaded pipeline synthesis for data-parallel kernels , 2014, 2014 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).
[15] John Paul Shen,et al. Speculative precomputation: long-range prefetching of delinquent loads , 2001, Proceedings 28th Annual International Symposium on Computer Architecture.
[16] Nigel P. Topham,et al. A comparison of data prefetching on an access decoupled and superscalar machine , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.
[17] Gu-Yeon Wei,et al. MachSuite: Benchmarks for accelerator design and customized architectures , 2014, 2014 IEEE International Symposium on Workload Characterization (IISWC).
[18] Karthikeyan Sankaralingam,et al. Efficient execution of memory access phases using dataflow specialization , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).
[19] Thomas F. Wenisch,et al. Spatial Memory Streaming , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).
[20] James C. Hoe,et al. CoRAM: an in-fabric memory architecture for FPGA-based computing , 2011, FPGA '11.
[21] David W. Binkley,et al. Program slicing , 2008, 2008 Frontiers of Software Maintenance.
[22] Somayeh Sardashti,et al. The gem5 simulator , 2011, CARN.
[23] Guilherme Ottoni,et al. Automatic thread extraction with decoupled software pipelining , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).
[24] Margaret Martonosi,et al. DeSC: Decoupled supply-compute communication management for heterogeneous architectures , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[25] Onur Mutlu,et al. Runahead execution: an alternative to very large instruction windows for out-of-order processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..
[26] Gu-Yeon Wei,et al. Aladdin: A pre-RTL, power-performance accelerator simulator enabling large design space exploration of customized architectures , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).
[27] Ninghui Sun,et al. DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.
[28] James E. Smith,et al. Data Cache Prefetching Using a Global History Buffer , 2005, IEEE Micro.
[29] Zhiru Zhang,et al. Flushing-enabled loop pipelining for high-level synthesis , 2014, 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC).
[30] Ralph Wittig,et al. Performance and power of cache-based reconfigurable computing , 2009, FPGA '09.
[31] Feng Liu,et al. CGPA: Coarse-Grained Pipelined Accelerators , 2014, 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC).