Memory-centric system interconnect design with hybrid memory cubes
暂无分享,去创建一个
Mehrzad Samadi | Scott Mahlke | Ankit Sethia | Ganesh S. Dasika | Ganesh Dasika | S. Mahlke | M. Samadi | Ankit Sethia
[1] Yuan Xie,et al. Energy-efficient GPU design with reconfigurable in-package graphics memory , 2012, ISLPED '12.
[2] William J. Dally,et al. Flattened butterfly: a cost-efficient topology for high-radix networks , 2007, ISCA '07.
[3] Anoop Gupta,et al. The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.
[4] Richard W. Vuduc,et al. Many-Thread Aware Prefetching Mechanisms for GPGPU Applications , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.
[5] William J. Dally,et al. Energy-efficient mechanisms for managing thread context in throughput processors , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).
[6] Robert J. Safranek,et al. Intel® QuickPath Interconnect Architectural Features Supporting Scalable System Architectures , 2010, 2010 18th IEEE Symposium on High Performance Interconnects.
[7] William J. Dally,et al. Cost-Efficient Dragonfly Topology for Large-Scale Systems , 2009, IEEE Micro.
[8] John R. Feehrer,et al. The Oracle Sparc T5 16-Core Processor Scales to Eight Sockets , 2013, IEEE Micro.
[9] Goichi Ono,et al. A 12.3mW 12.5Gb/s complete transceiver in 65nm CMOS , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).
[10] Nan Jiang,et al. A detailed and flexible cycle-accurate Network-on-Chip simulator , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).
[11] Hyesoon Kim,et al. An integrated GPU power and performance model , 2010, ISCA.
[12] Vivek De,et al. Life is CMOS: why chase the life after? , 2002, DAC '02.
[13] Marcelo Cintra,et al. Stream chaining: exploiting multiple levels of correlation in data prefetching , 2009, ISCA '09.
[14] Jose-Maria Arnau,et al. Boosting mobile GPU performance with a decoupled access/execute fragment processor , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).
[15] Jung Ho Ahn,et al. CACTI-3DD: Architecture-level modeling for 3D die-stacked DRAM main memory , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).
[16] G. Edward Suh,et al. SRAM-DRAM hybrid memory with applications to efficient register files in fine-grained multi-threading , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).
[17] Andreas Moshovos,et al. Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).
[18] Bruce Jacob,et al. Buffer-on-board memory systems , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).
[19] William J. Dally,et al. GPUs and the Future of Parallel Computing , 2011, IEEE Micro.
[20] Bruce Jacob,et al. The Memory System: You Can't Avoid It, You Can't Ignore It, You Can't Fake It , 2009, The Memory System: You Can't Avoid It, You Can't Ignore It, You Can't Fake It.
[21] Janak H. Patel,et al. Stride directed prefetching in scalar processors , 1992, MICRO 1992.
[22] Milo M. K. Martin,et al. Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.
[23] O Seongil,et al. McSimA+: A manycore simulator with application-level+ simulation and detailed microarchitecture modeling , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).
[24] Tao Tang,et al. Power Optimization for GPU Programs Based on Software Prefetching , 2011, 2011IEEE 10th International Conference on Trust, Security and Privacy in Computing and Communications.
[25] Sriram R. Vangal,et al. A 2 Tb/s 6$\,\times\,$ 4 Mesh Network for a Single-Chip Cloud Computer With DVFS in 45 nm CMOS , 2011, IEEE Journal of Solid-State Circuits.
[26] Joseph Antony,et al. Exploring Thread and Memory Placement on NUMA Architectures: Solaris and Linux, UltraSPARC/FirePlane and Opteron/HyperTransport , 2006, HiPC.
[27] William J. Dally,et al. Design tradeoffs for tiled CMP on-chip networks , 2006, ICS '06.
[28] William J. Dally,et al. Principles and Practices of Interconnection Networks , 2004 .
[29] Jung Ho Ahn,et al. The McPAT Framework for Multicore and Manycore Architectures: Simultaneously Modeling Power, Area, and Timing , 2013, TACO.
[30] Homan Igehy,et al. Prefetching in a texture cache architecture , 1998, Workshop on Graphics Hardware.
[31] M. Horowitz,et al. A 14-mW 6.25-Gb/s Transceiver in 90-nm CMOS , 2007, IEEE Journal of Solid-State Circuits.
[32] Kevin Skadron,et al. Increasing memory miss tolerance for SIMD cores , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.
[33] William J. Dally,et al. The BlackWidow High-Radix Clos Network , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).
[34] William J. Dally,et al. Memory access scheduling , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).
[35] Onur Mutlu,et al. Coordinated control of multiple prefetchers in multi-core systems , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[36] Aamer Jaleel,et al. Fully-Buffered DIMM Memory Architectures: Understanding Mechanisms, Overheads and Scaling , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.
[37] Scott A. Mahlke,et al. PEPSC: A Power-Efficient Processor for Scientific Computing , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.
[38] Jason Cong,et al. Utilizing Radio-Frequency Interconnect for a Many-DIMM DRAM System , 2012, IEEE Journal on Emerging and Selected Topics in Circuits and Systems.
[39] Erik Lindholm,et al. NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.
[40] Tom R. Halfhill. NVIDIA's Next-Generation CUDA Compute and Graphics Architecture, Code-Named Fermi, Adds Muscle for Parallel Processing , 2009 .
[41] Norman P. Jouppi,et al. Combining memory and a controller with photonics through 3D-stacking to enable scalable and energy-efficient systems , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).
[42] J. Thomas Pawlowski,et al. Hybrid memory cube (HMC) , 2011, 2011 IEEE Hot Chips 23 Symposium (HCS).
[43] A. Shubat,et al. Mappable peripheral memory for high speed applications , 1989, Proceedings. VLSI and Computer Peripherals. COMPEURO 89.
[44] Norman P. Jouppi,et al. CACTI 2.0: An Integrated Cache Timing and Power Model , 2002 .
[45] Onur Mutlu,et al. Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.
[46] Mark Horowitz,et al. Energy-Efficient Floating-Point Unit Design , 2011, IEEE Transactions on Computers.
[47] Henry Wong,et al. Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.
[48] Jean-Loup Baer,et al. Effective Hardware Based Data Prefetching for High-Performance Processors , 1995, IEEE Trans. Computers.
[49] Sally A. McKee,et al. Hitting the memory wall: implications of the obvious , 1995, CARN.
[50] Brian Rogers,et al. Scaling the bandwidth wall: challenges in and avenues for CMP scaling , 2009, ISCA '09.
[51] Mike Higgins,et al. Cray Cascade: A scalable HPC system based on a Dragonfly network , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[52] Larry Kaplan,et al. The Gemini System Interconnect , 2010, 2010 18th IEEE Symposium on High Performance Interconnects.