Closing the Ninja Performance Gap through Traditional Programming and Compiler Technology
暂无分享,去创建一个
Pradeep Dubey | Milind Girkar | Rakesh Krishnaiyer | Mikhail Smelyanskiy | Hideki Saito | Changkyu Kim | Jatin Chhugani | Nadathur Rajagopalan Satish
[1] Ayal Zaks,et al. Outer-loop vectorization - revisited for short SIMD architectures , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).
[2] Pradeep Dubey,et al. 3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[3] Richard W. Vuduc,et al. Direct N-body Kernels for Multicore Platforms , 2009, 2009 International Conference on Parallel Processing.
[4] Samuel Williams,et al. Optimization of a lattice Boltzmann computation on state-of-the-art multicore platforms , 2009, J. Parallel Distributed Comput..
[5] Trevor N. Mudge,et al. Power: A First-Class Architectural Design Constraint , 2001, Computer.
[6] Yi Yang,et al. A GPGPU compiler for memory optimization and parallelism management , 2010, PLDI '10.
[7] Guy E. Blelloch,et al. Low depth cache-oblivious algorithms , 2010, SPAA '10.
[8] Sverre J. Aarseth. Gravitational N-Body Simulations: Tools and Algorithms , 2003 .
[9] Pradeep Dubey,et al. Mapping High-Fidelity Volume Rendering for Medical Imaging to CPU, GPU and Many-Core Architectures , 2009, IEEE Transactions on Visualization and Computer Graphics.
[10] M. Knaup,et al. Hyperfast Perspective Cone--Beam Backprojection , 2006, 2006 IEEE Nuclear Science Symposium Conference Record.
[11] Leila Ismail,et al. Performance Evaluation of Convolution on the Cell Broadband Engine Processor , 2011, IEEE Transactions on Parallel and Distributed Systems.
[12] Pradeep Dubey,et al. Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU , 2010, ISCA.
[13] William J. Dally. The end of denial architecture and the rise of throughput computing , 2009, DAC 2009.
[14] M. Musiela,et al. The Market Model of Interest Rate Dynamics , 1997 .
[15] Katherine Yelick,et al. Auto-tuning stencil codes for cache-based multicore platforms , 2009 .
[16] Pradeep Dubey,et al. Convergence of Recognition, Mining, and Synthesis Workloads and Its Implications , 2008, Proceedings of the IEEE.
[17] Uday Bondhugula,et al. Can CPUs Match GPUs on Performance with Productivity?: Experiences with Optimizing a FLOP-intensive Application on CPUs and GPU , 2010 .
[18] Mahmut T. Kandemir,et al. Cache topology aware computation mapping for multicores , 2010, PLDI '10.
[19] Samuel Williams,et al. The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .
[20] Peng Wu,et al. Vectorization for SIMD architectures with alignment constraints , 2004, PLDI '04.
[21] Pradeep Dubey,et al. Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort , 2010, SIGMOD Conference.
[22] Hsien-Hsin S. Lee,et al. An optimized 3D-stacked memory architecture by exploiting excessive, high-density TSV bandwidth , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.
[23] Pradeep Dubey,et al. FAST: fast architecture sensitive tree search on modern CPUs and GPUs , 2010, SIGMOD Conference.
[24] Wen-mei W. Hwu,et al. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.
[25] Andreas Polze,et al. Joint Forces: From Multithreaded Programming to GPU Computing , 2011, IEEE Software.
[26] Andrew W. Moore,et al. 'N-Body' Problems in Statistical Learning , 2000, NIPS.
[27] Edward T. Grochowski,et al. Larrabee: A many-Core x86 architecture for visual computing , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).
[28] Pradeep Dubey,et al. Efficient implementation of sorting on multi-core SIMD CPU architecture , 2008, Proc. VLDB Endow..
[29] William J. Dally,et al. A portable runtime interface for multi-level memory hierarchies , 2008, PPoPP.
[30] Michael C. Sukop,et al. Lattice Boltzmann Modeling: An Introduction for Geoscientists and Engineers , 2005 .
[31] Ryan Newton,et al. A Synergetic Approach to Throughput Computing on x86-Based Multicore Desktops , 2011, IEEE Software.
[32] Jin Zhou,et al. Bamboo: a data-centric, object-oriented approach to many-core software , 2010, PLDI '10.
[33] Kai Li,et al. The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).
[34] Richard Henderson,et al. Multi-platform auto-vectorization , 2006, International Symposium on Code Generation and Optimization (CGO'06).
[35] F. Black,et al. The Pricing of Options and Corporate Liabilities , 1973, Journal of Political Economy.
[36] Michael B. Giles. Monte Carlo evaluation of sensitivities in computational finance , 2007 .
[37] Rohit Chandra,et al. Parallel programming in openMP , 2000 .