Can traditional programming bridge the Ninja performance gap for parallel computing applications
暂无分享,去创建一个
Pradeep Dubey | Rakesh Krishnaiyer | Hideki Saito | Mikhail Smelyanskiy | Nadathur Satish | Changkyu Kim | Jatin Chhugani | Milind B. Girkar
[1] Pat Hanrahan,et al. Volume Rendering , 2020, Definitions.
[2] Pradeep Dubey,et al. 3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[3] Richard W. Vuduc,et al. Direct N-body Kernels for Multicore Platforms , 2009, 2009 International Conference on Parallel Processing.
[4] Peng Wu,et al. Vectorization for SIMD architectures with alignment constraints , 2004, PLDI '04.
[5] Pradeep Dubey,et al. Convergence of Recognition, Mining, and Synthesis Workloads and Its Implications , 2008, Proceedings of the IEEE.
[6] Rohit Chandra,et al. Parallel programming in openMP , 2000 .
[7] Uday Bondhugula,et al. Can CPUs Match GPUs on Performance with Productivity?: Experiences with Optimizing a FLOP-intensive Application on CPUs and GPU , 2010 .
[8] Mahmut T. Kandemir,et al. Cache topology aware computation mapping for multicores , 2010, PLDI '10.
[9] Michael B. Giles. Monte Carlo evaluation of sensitivities in computational finance , 2007 .
[10] Kai Li,et al. The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).
[11] Richard Henderson,et al. Multi-platform auto-vectorization , 2006, International Symposium on Code Generation and Optimization (CGO'06).
[12] Michael C. Sukop,et al. Lattice Boltzmann Modeling: An Introduction for Geoscientists and Engineers , 2005 .
[13] Pradeep Dubey,et al. Mapping High-Fidelity Volume Rendering for Medical Imaging to CPU, GPU and Many-Core Architectures , 2009, IEEE Transactions on Visualization and Computer Graphics.
[14] M. Knaup,et al. Hyperfast Perspective Cone--Beam Backprojection , 2006, 2006 IEEE Nuclear Science Symposium Conference Record.
[15] Samuel Williams,et al. The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .
[16] Pradeep Dubey,et al. Efficient implementation of sorting on multi-core SIMD CPU architecture , 2008, Proc. VLDB Endow..
[17] Yi Yang,et al. A GPGPU compiler for memory optimization and parallelism management , 2010, PLDI '10.
[18] Guy E. Blelloch,et al. Low depth cache-oblivious algorithms , 2010, SPAA '10.
[19] Samuel Williams,et al. Optimization of a lattice Boltzmann computation on state-of-the-art multicore platforms , 2009, J. Parallel Distributed Comput..
[20] Trevor N. Mudge,et al. Power: A First-Class Architectural Design Constraint , 2001, Computer.
[21] Andreas Polze,et al. Joint Forces: From Multithreaded Programming to GPU Computing , 2011, IEEE Software.
[22] Andrew W. Moore,et al. 'N-Body' Problems in Statistical Learning , 2000, NIPS.
[23] Wen-mei W. Hwu,et al. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.
[24] Pradeep Dubey,et al. Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort , 2010, SIGMOD Conference.
[25] Hsien-Hsin S. Lee,et al. An optimized 3D-stacked memory architecture by exploiting excessive, high-density TSV bandwidth , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.
[26] Sverre J. Aarseth,et al. Gravitational N-Body Simulations , 2003 .
[27] Pradeep Dubey,et al. FAST: fast architecture sensitive tree search on modern CPUs and GPUs , 2010, SIGMOD Conference.
[28] Ryan Newton,et al. A Synergetic Approach to Throughput Computing on x86-Based Multicore Desktops , 2011, IEEE Software.
[29] Jin Zhou,et al. Bamboo: a data-centric, object-oriented approach to many-core software , 2010, PLDI '10.
[30] Leila Ismail,et al. Performance Evaluation of Convolution on the Cell Broadband Engine Processor , 2011, IEEE Transactions on Parallel and Distributed Systems.
[31] Pradeep Dubey,et al. Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU , 2010, ISCA.
[32] Pradeep Dubey,et al. Larrabee: A Many-Core x86 Architecture for Visual Computing , 2009, IEEE Micro.
[33] Pradeep Dubey,et al. Closing the Ninja Performance Gap through Traditional Programming and Compiler Technology , 2012 .
[34] Sverre J. Aarseth. Gravitational N-Body Simulations: Tools and Algorithms , 2003 .
[35] M. Musiela,et al. The Market Model of Interest Rate Dynamics , 1997 .
[36] Katherine Yelick,et al. Auto-tuning stencil codes for cache-based multicore platforms , 2009 .
[37] Ayal Zaks,et al. Outer-loop vectorization - revisited for short SIMD architectures , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).
[38] F. Black,et al. The Pricing of Options and Corporate Liabilities , 1973, Journal of Political Economy.
[39] William J. Dally,et al. A portable runtime interface for multi-level memory hierarchies , 2008, PPoPP.
[40] William J. Dally. The end of denial architecture and the rise of throughput computing , 2009, DAC 2009.