Location-aware cache management for many-core processors with deep cache hierarchy
暂无分享,去创建一个
[1] Nikil D. Dutt,et al. Efficient utilization of scratch-pad memory in embedded processor applications , 1997, Proceedings European Design and Test Conference. ED & TC 97.
[2] Yutao Zhong,et al. Predicting whole-program locality through reuse distance analysis , 2003, PLDI.
[3] Mary K. Vernon,et al. Efficient synchronization primitives for large-scale cache-coherent multiprocessors , 1989, ASPLOS III.
[4] William J. Dally,et al. Principles and Practices of Interconnection Networks , 2004 .
[5] Evgenia Smirni,et al. The KSR1: experimentation and modeling of poststore , 1993, SIGMETRICS '93.
[6] Aamer Jaleel,et al. High performance cache replacement using re-reference interval prediction (RRIP) , 2010, ISCA.
[7] Anoop Gupta,et al. The Stanford Dash multiprocessor , 1992, Computer.
[8] Yue Wang,et al. REVERSE-TIME MIGRATION , 1999 .
[9] Wayne Berke,et al. A Cache Technique for Synchronization Variables in Highly Parallel, Shared Memory Systems by , 1988 .
[10] Jung Ho Ahn,et al. Merrimac: Supercomputing with Streams , 2003, ACM/IEEE SC 2003 Conference (SC'03).
[11] William N. Scherer,et al. Scalable queue-based spin locks with timeout , 2001, PPoPP '01.
[12] José González,et al. Elastic cooperative caching: an autonomous dynamically adaptive memory hierarchy for chip multiprocessors , 2010, ISCA.
[13] Yale N. Patt,et al. Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).
[14] Michel Dubois,et al. Delayed consistency and its effects on the miss rate of parallel programs , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).
[15] Alfred V. Aho,et al. Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.
[16] Jean-Loup Baer,et al. Modified LRU policies for improving second-level cache behavior , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).
[17] Pradeep Dubey,et al. Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-Core CPUs , 2009, Proc. VLDB Endow..
[18] Christoforos E. Kozyrakis,et al. Comparing memory systems for chip multiprocessors , 2007, ISCA '07.
[19] Francesco Zappa Nardelli,et al. 86-TSO : A Rigorous and Usable Programmer ’ s Model for x 86 Multiprocessors , 2010 .
[20] David A. Patterson,et al. Virtual Local Stores: Enabling Software-Managed Memory Hierarchies in Mainstream Computing Environments , 2009 .
[21] Pradeep Dubey,et al. Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort , 2010, SIGMOD Conference.
[22] Samuel Williams,et al. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.
[23] Aamer Jaleel,et al. Adaptive insertion policies for high performance caching , 2007, ISCA '07.
[24] Paul Feautrier,et al. Some efficient solutions to the affine scheduling problem. Part II. Multidimensional time , 1992, International Journal of Parallel Programming.
[25] Tom R. Halfhill. NVIDIA's Next-Generation CUDA Compute and Graphics Architecture, Code-Named Fermi, Adds Muscle for Parallel Processing , 2009 .
[26] Josep Torrellas,et al. Data forwarding in scalable shared-memory multiprocessors , 1995, ICS '95.
[27] Henry Hoffmann,et al. Remote Store Programming , 2010, HiPEAC.
[28] Norman P. Jouppi,et al. Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).
[29] Mark Alan Gebhart,et al. Energy-efficient mechanisms for managing on-chip storage in throughput processors , 2012 .
[30] David Black-Schaffer,et al. Efficient techniques for predicting cache sharing and throughput , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).
[31] Martin Hopkins,et al. Synergistic Processing in Cell's Multicore Architecture , 2006, IEEE Micro.
[32] Donald Yeung,et al. Studying multicore processor scaling via reuse distance analysis , 2013, ISCA.
[33] Mary K. Vernon,et al. Efficient synchronization primitives for large-scale cache-coherent multiprocessors , 1989, ASPLOS 1989.
[34] Mitchell Hayenga,et al. MadCache: A PC-aware Cache Insertion Policy , 2010 .
[35] Nian-Feng Tzeng,et al. Distributing Hot-Spot Addressing in Large-Scale Multiprocessors , 1987, IEEE Transactions on Computers.
[36] Sharad Malik,et al. Orion: a power-performance simulator for interconnection networks , 2002, MICRO.
[37] Michael Wolfe,et al. High performance compilers for parallel computing , 1995 .
[38] David Eklov,et al. StatStack: Efficient modeling of LRU caches , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).
[39] Kai Li,et al. The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).
[40] Arnold L. Rosenberg,et al. Using the compiler to improve cache replacement decisions , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.
[41] Paul Feautrier,et al. Some efficient solutions to the affine scheduling problem. I. One-dimensional time , 1992, International Journal of Parallel Programming.
[42] David A. Wood,et al. ASR: Adaptive Selective Replication for CMP Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).
[43] FeautrierPaul. Some efficient solutions to the affine scheduling problem , 1992 .
[44] Srinivas Devadas,et al. Dynamic Cache Partitioning via Columnization , 2000, DAC 2000.
[45] Jung Ho Ahn,et al. A Comprehensive Memory Modeling Tool and Its Application to the Design and Analysis of Future Memory Hierarchies , 2008, 2008 International Symposium on Computer Architecture.
[46] George Kurian,et al. Graphite: A distributed parallel simulator for multicores , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.
[47] Björn Lisper,et al. Data cache locking for higher program predictability , 2003, SIGMETRICS '03.
[48] Pradeep Dubey,et al. 3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[49] N PattYale,et al. Adaptive insertion policies for high performance caching , 2007 .