Balance Principles for Algorithm-Architecture Co-Design
暂无分享,去创建一个
Richard W. Vuduc | Aparna Chandramowlishwaran | Kenneth Czechowski | Chris McClanahan | Casey Battaglino | R. Vuduc | Aparna Chandramowlishwaran | C. McClanahan | Kenneth Czechowski | Casey Battaglino
[1] PattersonDavid,et al. LogP: towards a realistic model of parallel computation , 1993 .
[2] Dror Irony,et al. Communication lower bounds for distributed-memory matrix multiplication , 2004, J. Parallel Distributed Comput..
[3] Guy E. Blelloch,et al. Scheduling threads for constructive cache sharing on CMPs , 2007, SPAA '07.
[4] Gul A. Agha,et al. Analysis of Parallel Algorithms for Energy Conservation in Scalable Multicore Architectures , 2009, 2009 International Conference on Parallel Processing.
[5] Samuel Williams,et al. The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .
[6] Edward D. Lazowska,et al. Quantitative System Performance , 1985, Int. CMG Conference.
[7] G. Amdhal,et al. Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).
[8] David Patterson,et al. The Top 10 Innovations in the New NVIDIA Fermi Architecture, and the Top 3 Next Challenges , 2009 .
[9] Ken Kennedy,et al. Estimating Interlock and Improving Balance for Pipelined Architectures , 1988, J. Parallel Distributed Comput..
[10] James Demmel,et al. Minimizing Communication in Linear Algebra , 2009, ArXiv.
[11] James Demmel,et al. Modeling the benefits of mixed data and task parallelism , 1995, SPAA '95.
[12] Alok Aggarwal,et al. The input/output complexity of sorting and related problems , 1988, CACM.
[13] H. T. Kung,et al. I/O complexity: The red-blue pebble game , 1981, STOC '81.
[14] Guy E. Blelloch,et al. Programming parallel algorithms , 1996, CACM.
[15] Richard W. Vuduc,et al. Model-driven autotuning of sparse matrix-vector multiply on GPUs , 2010, PPoPP '10.
[16] William Gropp,et al. An introductory exascale feasibility study for FFTs and multigrid , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).
[17] H. T. Kung. Memory requirements for balanced computer architectures , 1986, ISCA '86.
[18] Tom R. Halfhill. NVIDIA's Next-Generation CUDA Compute and Graphics Architecture, Code-Named Fermi, Adds Muscle for Parallel Processing , 2009 .
[19] Leslie G. Valiant,et al. A bridging model for parallel computation , 1990, CACM.
[20] Mark D. Hill,et al. Amdahl's Law in the Multicore Era , 2008 .
[21] Keshav Pingali,et al. An experimental comparison of cache-oblivious and cache-conscious programs , 2007, SPAA '07.
[22] Gabriel H. Loh,et al. 3D-Stacked Memory Architectures for Multi-core Processors , 2008, 2008 International Symposium on Computer Architecture.
[23] Yao Zhang,et al. A quantitative performance analysis model for GPU architectures , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.
[24] Ramesh Subramonian,et al. LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.
[25] Guy E. Blelloch,et al. Low depth cache-oblivious algorithms , 2010, SPAA '10.
[26] Leslie G. Valiant. A Bridging Model for Multi-core Computing , 2008, ESA.
[27] Hyesoon Kim,et al. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.
[28] Edward D. Lazowska,et al. Quantitative system performance - computer system analysis using queueing network models , 1983, Int. CMG Conference.
[29] KimHyesoon,et al. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009 .
[30] Constantine Bekas,et al. A new energy aware performance metric , 2010, Computer Science - Research and Development.
[31] James Reinders,et al. Intel® threading building blocks , 2008 .
[32] Matteo Frigo,et al. Reducers and other Cilk++ hyperobjects , 2009, SPAA '09.
[33] M. Greenstreet,et al. An Energy Aware Model of Computation , 2008 .
[34] Guy E. Blelloch,et al. The data locality of work stealing , 2000, SPAA.
[35] Bradley C. Kuszmaul,et al. Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.
[36] Robert W. Numrich,et al. A metric space for computer programs and the principle of computational least action , 2008, The Journal of Supercomputing.
[37] Samuel Williams,et al. Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.
[38] Alain J. Martin. Towards an energy complexity of computation , 2001, Inf. Process. Lett..
[39] Richard P. Brent,et al. The Parallel Evaluation of General Arithmetic Expressions , 1974, JACM.
[40] Ravi Jain,et al. Towards a model of energy complexity for algorithms [mobile wireless applications] , 2005, IEEE Wireless Communications and Networking Conference, 2005.
[41] Hsien-Hsin S. Lee,et al. Extending Amdahl's Law for Energy-Efficient Computing in the Many-Core Era , 2008, Computer.
[42] Charles E. Leiserson,et al. Cache-Oblivious Algorithms , 2003, CIAC.
[43] David A. Patterson,et al. Latency lags bandwith , 2004, CACM.