Architecting the last-level cache for GPUs using STT-MRAM nonvolatile memory
暂无分享,去创建一个
[1] Cong Xu,et al. Bandwidth-aware reconfigurable cache design with hybrid memory technologies , 2011, 2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).
[2] Mike O'Connor,et al. Cache-Conscious Wavefront Scheduling , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.
[3] Mahmut T. Kandemir,et al. Orchestrated scheduling and prefetching for GPGPUs , 2013, ISCA.
[4] Mohammad Arjomand,et al. An efficient STT-RAM last level cache architecture for GPUs , 2014, 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC).
[5] Jun Yang,et al. Energy reduction for STT-RAM using early write termination , 2009, 2009 IEEE/ACM International Conference on Computer-Aided Design - Digest of Technical Papers.
[6] Benjamin C. Lee,et al. Disintegrated control for energy-efficient and heterogeneous memory systems , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).
[7] Xiaoxia Wu,et al. Power and performance of read-write aware Hybrid Caches with non-volatile memories , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.
[8] Yiran Chen,et al. Circuit and microarchitecture evaluation of 3D stacking magnetic RAM (MRAM) as a universal memory replacement , 2008, 2008 45th ACM/IEEE Design Automation Conference.
[9] Yiming Huai,et al. Spin-Transfer Torque MRAM (STT-MRAM): Challenges and Prospects , 2008 .
[10] Norman P. Jouppi,et al. CACTI 6.0: A Tool to Model Large Caches , 2009 .
[11] Yu Wang,et al. A STT-RAM-based low-power hybrid register file for GPGPUs , 2015, 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC).
[12] Wei Wu,et al. Reducing cache power with low-cost, multi-bit error-correcting codes , 2010, ISCA.
[13] Yuan Xie,et al. i2WAP: Improving non-volatile cache lifetime by reducing inter- and intra-set write variations , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).
[14] Yiran Chen,et al. Cache coherence enabled adaptive refresh for volatile STT-RAM , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).
[15] Mohammad Arjomand,et al. Architecting the Last-Level Cache for GPUs using STT-RAM Technology , 2015, ACM Trans. Design Autom. Electr. Syst..
[16] Chita R. Das,et al. Cache revive: Architecting volatile STT-RAM caches for enhanced performance in CMPs , 2012, DAC Design Automation Conference 2012.
[17] Yuan Xie,et al. 3D GPU architecture using cache stacking: Performance, cost, power and thermal analysis , 2009, 2009 IEEE International Conference on Computer Design.
[18] Mohamed Zahran,et al. Efficient utilization of GPGPU cache hierarchy , 2015, GPGPU@PPoPP.
[19] Hisashi Shima,et al. Resistive Random Access Memory (ReRAM) Based on Metal Oxides , 2010, Proceedings of the IEEE.
[20] Yuan Xue,et al. Prolonging PCM lifetime through energy-efficient, segment-aware, and wear-resistant page allocation , 2014, 2014 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED).
[21] Yiran Chen,et al. Exploration of GPGPU register file architecture using domain-wall-shift-write based racetrack memory , 2014, 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC).
[22] William J. Dally,et al. Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.
[23] Kaushik Roy,et al. STAG: Spintronic-Tape Architecture for GPGPU cache hierarchies , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).
[24] Xuhao Chen,et al. Adaptive Cache Management for Energy-Efficient GPU Computing , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.
[25] Erik Lindholm,et al. NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.
[26] Mohammad Arjomand,et al. Variable Resistance Spectrum Assignment in Phase Change Memory Systems , 2015, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.
[27] Karthikeyan Sankaralingam,et al. Challenge benchmarks that must be conquered to sustain the gpu revolution , 2011 .
[28] Amin Jadidi,et al. High-endurance and performance-efficient design of hybrid cache architectures through adaptive line replacement , 2011, IEEE/ACM International Symposium on Low Power Electronics and Design.
[29] Wen-mei W. Hwu,et al. Data Layout Transformation Exploiting Memory-Level Parallelism in Structured Grid Many-Core Applications , 2010, International Journal of Parallel Programming.
[30] Majid Sarrafzadeh,et al. A memory optimization technique for software-managed scratchpad memory in GPUs , 2009, 2009 IEEE 7th Symposium on Application Specific Processors.
[31] Sudhakar Yalamanchili,et al. Harmonia: Balancing compute and memory power in high-performance GPUs , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).
[32] Xipeng Shen,et al. Streamlining GPU applications on the fly: thread divergence elimination through runtime thread-data remapping , 2010, ICS '10.
[33] Yu Wang,et al. Hi-fi playback: Tolerating position errors in shift operations of racetrack memory , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).
[34] Hyesoon Kim,et al. TAP: A TLP-aware cache management policy for a CPU-GPU heterogeneous architecture , 2012, IEEE International Symposium on High-Performance Comp Architecture.
[35] Doe Hyun Yoon,et al. Flexible cache error protection using an ECC FIFO , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.
[36] Mohammad Arjomand,et al. A Reliable 3D MLC PCM Architecture with Resistance Drift Predictor , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.
[37] Norman P. Jouppi,et al. FREE-p: Protecting non-volatile memory against both hard and soft errors , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.
[38] Tao Li,et al. Power-performance co-optimization of throughput core architecture using resistive memory , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).
[39] Onur Mutlu,et al. Improving GPU performance via large warps and two-level warp scheduling , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[40] Doe Hyun Yoon,et al. Memory mapped ECC: low-cost error protection for last level caches , 2009, ISCA '09.
[41] Kevin Skadron,et al. The Sharing Tracker: Using Ideas from Cache Coherence Hardware to Reduce Off-Chip Memory Traffic with Non-Coherent Caches , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[42] Wenqing Wu,et al. Multi retention level STT-RAM cache designs with a dynamic refresh scheme , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[43] Jing-Yang Jou,et al. Cache Capacity Aware Thread Scheduling for Irregular Memory Access on many-core GPGPUs , 2013, 2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC).
[44] Henry Wong,et al. Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.
[45] Mircea R. Stan,et al. Relaxing non-volatility for fast and energy-efficient STT-RAM caches , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.
[46] Mohammad Arjomand,et al. Reducing access latency of MLC PCMs through line striping , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).
[47] Mahmut T. Kandemir,et al. OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance , 2013, ASPLOS '13.
[48] Kinam Kim,et al. Bi-layered RRAM with unlimited endurance and extremely uniform switching , 2011, 2011 Symposium on VLSI Technology - Digest of Technical Papers.
[49] Won Woo Ro,et al. Warped-Compression: Enabling power efficient GPUs through register compression , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).
[50] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).
[51] Yi Yang,et al. Shared memory multiplexing: A novel way to improve GPGPU throughput , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).
[52] Yiran Chen,et al. An efficient STT-RAM-based register file in GPU architectures , 2015, The 20th Asia and South Pacific Design Automation Conference.
[53] Yuan Xie,et al. A Write-Aware STTRAM-Based Register File Architecture for GPGPU , 2015, ACM J. Emerg. Technol. Comput. Syst..
[54] Wen-mei W. Hwu,et al. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .
[55] Richard W. Vuduc,et al. Many-Thread Aware Prefetching Mechanisms for GPGPU Applications , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.
[56] Jinwoo Shin,et al. DRAM Scheduling Policy for GPGPU Architectures Based on a Potential Function , 2012, IEEE Computer Architecture Letters.
[57] Sudhakar Yalamanchili,et al. An energy efficient cache design using Spin Torque Transfer (STT) RAM , 2010, 2010 ACM/IEEE International Symposium on Low-Power Electronics and Design (ISLPED).
[58] Yiran Chen,et al. Performance, Power, and Reliability Tradeoffs of STT-RAM Cell Subject to Architecture-Level Requirement , 2011, IEEE Transactions on Magnetics.
[59] Yi Yang,et al. A GPGPU compiler for memory optimization and parallelism management , 2010, PLDI '10.
[60] Mike O'Connor,et al. Divergence-Aware Warp Scheduling , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[61] Carole-Jean Wu,et al. CAWA: Coordinated warp scheduling and Cache Prioritization for critical warp acceleration of GPGPU workloads , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).
[62] Margaret Martonosi,et al. Characterizing and improving the use of demand-fetched caches in GPUs , 2012, ICS '12.
[63] Wen-mei W. Hwu,et al. Program optimization space pruning for a multithreaded gpu , 2008, CGO '08.