Supporting Address Translation for Accelerator-Centric Architectures
暂无分享,去创建一个
Jason Cong | Yuchen Hao | Glenn Reinman | Zhenman Fang | Zhenman Fang | J. Cong | G. Reinman | Y. Hao | Glenn D. Reinman
[1] Somayeh Sardashti,et al. The gem5 simulator , 2011, CARN.
[2] Per Hammarlund,et al. 4th generation Intel™ Core processor, codenamed Haswell , 2013, 2013 IEEE Hot Chips 25 Symposium (HCS).
[3] Jason Cong,et al. Architecture support for accelerator-rich CMPs , 2012, DAC Design Automation Conference 2012.
[4] Steven Swanson,et al. Conservation cores: reducing the energy of mature computations , 2010, ASPLOS XV.
[5] Chen-Yong Cher,et al. A wire-speed powerTM processor: 2.3GHz 45nm SOI with 16 cores and 64 threads , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).
[6] David L. Black,et al. Translation lookaside buffer consistency: a software approach , 1989, ASPLOS III.
[7] Abhishek Bhattacharjee,et al. Architectural support for address translation on GPUs: designing memory management units for CPU/GPUs with unified address spaces , 2014, ASPLOS.
[8] David A. Wood,et al. A comparative analysis of microarchitecture effects on CPU and GPU memory system behavior , 2014, 2014 IEEE International Symposium on Workload Characterization (IISWC).
[9] Babak Falsafi,et al. Meet the walkers accelerating index traversals for in-memory databases , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[10] Ming Yang,et al. Sonic Millip3De: A massively parallel 3D-stacked accelerator for 3D ultrasound , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).
[11] Scott A. Mahlke,et al. Polymorphic Pipeline Array: A flexible multicore accelerator with virtualized execution for mobile multimedia applications , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[12] Michael M. Swift,et al. Efficient virtual memory for big memory servers , 2013, ISCA.
[13] Norman P. Jouppi,et al. CACTI 6.0: A Tool to Model Large Caches , 2009 .
[14] G. Kandiraju,et al. Going the distance for TLB prefetching: an application-driven study , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.
[15] Kai Li,et al. The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).
[16] Srilatha Manne,et al. Accelerating two-dimensional page walks for virtualized systems , 2008, ASPLOS.
[17] Luis Ceze,et al. Neural Acceleration for General-Purpose Approximate Programs , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.
[18] Stephen Neuendorffer,et al. Building zynq® accelerators with Vivado® high level synthesis , 2013, FPGA '13.
[19] Jason Cong,et al. Architecture Support for Domain-Specific Accelerator-Rich CMPs , 2014, ACM Trans. Embed. Comput. Syst..
[20] Jason Cong,et al. High-Level Synthesis for FPGAs: From Prototyping to Deployment , 2011, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.
[21] Jason Cong,et al. A quantitative analysis on microarchitectures of modern CPU-FPGA platforms , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).
[22] Osman S. Unsal,et al. Energy-efficient address translation , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[23] Christoforos E. Kozyrakis,et al. Convolution engine: balancing efficiency & flexibility in specialized computing , 2013, ISCA.
[24] Mark Silberstein,et al. ActivePointers: A Case for Software Address Translation on GPUs , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[25] Jason Cong,et al. PARADE: A cycle-accurate full-system simulation Platform for Accelerator-Rich Architectural Design and Exploration , 2015, 2015 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).
[26] Jacob Nelson,et al. SNNAP: Approximate computing on programmable SoCs via neural acceleration , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).
[27] Eric S. Chung,et al. LINQits: big data on little clients , 2013, ISCA.
[28] Trevor N. Mudge,et al. A look at several memory management units, TLB-refill mechanisms, and page table organizations , 1998, ASPLOS VIII.
[29] Thomas A. Ziaja,et al. Sparc T4: A Dynamically Threaded Server-on-a-Chip , 2012, IEEE Micro.
[30] Gu-Yeon Wei,et al. Co-designing accelerators and SoC interfaces using gem5-Aladdin , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[31] Christoforos E. Kozyrakis,et al. Understanding sources of inefficiency in general-purpose chips , 2010, ISCA.
[32] Hari Angepat,et al. An FPGA-based In-Line Accelerator for Memcached , 2014, IEEE Computer Architecture Letters.
[33] David A. Wood,et al. Border control: Sandboxing accelerators , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[34] Per Stenström,et al. Recency-based TLB preloading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).
[35] Abhishek Bhattacharjee,et al. Large-reach memory management unit caches , 2013, MICRO.
[36] Margaret Martonosi,et al. Shared last-level TLBs for chip multiprocessors , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.
[37] David A. Wood,et al. Supporting x86-64 address translation for 100s of GPU lanes , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).
[38] Christoforos E. Kozyrakis,et al. Convolution engine , 2015, Commun. ACM.
[39] Alan L. Cox,et al. Translation caching: skip, don't walk (the page table) , 2010, ISCA.
[40] Mark D. Hill,et al. Surpassing the TLB performance of superpages with less operating system support , 1994, ASPLOS VI.
[41] Karthikeyan Sankaralingam,et al. DySER: Unifying Functionality and Parallelism Specialization for Energy-Efficient Computing , 2012, IEEE Micro.
[42] Trevor N. Mudge,et al. Software-managed address translation , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.
[43] Muli Ben-Yehuda,et al. IOMMU: strategies for mitigating the IOTLB bottleneck , 2010, ISCA'10.
[44] Karthikeyan Sankaralingam,et al. Dark Silicon and the End of Multicore Scaling , 2012, IEEE Micro.
[45] Muli Ben-Yehuda,et al. rIOMMU: Efficient IOMMU for I/O Devices that Employ Ring Buffers , 2015, ASPLOS.
[46] Collin McCurdy,et al. Investigating the TLB Behavior of High-end Scientific Applications on Commodity Microprocessors , 2008, ISPASS 2008 - IEEE International Symposium on Performance Analysis of Systems and software.
[47] Thomas F. Wenisch,et al. Thin servers with smart pipes: designing SoC accelerators for memcached , 2013, ISCA.
[48] Anil Krishna,et al. Hardware acceleration in the IBM PowerEN processor: architecture and performance , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).
[49] Gu-Yeon Wei,et al. Toward Cache-Friendly Hardware Accelerators , 2015 .