Can traditional programming bridge the Ninja performance gap for parallel computing applications?
暂无分享,去创建一个
Pradeep Dubey | Milind Girkar | Rakesh Krishnaiyer | Hideki Saito | Mikhail Smelyanskiy | Nadathur Satish | Changkyu Kim | Jatin Chhugani | P. Dubey | N. Satish | M. Smelyanskiy | J. Chhugani | Changkyu Kim | M. Girkar | R. Krishnaiyer | Hideki Saito
[1] Michael B. Giles. Monte Carlo evaluation of sensitivities in computational finance , 2007 .
[2] Peng Wu,et al. Vectorization for SIMD architectures with alignment constraints , 2004, PLDI '04.
[3] Ryan Newton,et al. A Synergetic Approach to Throughput Computing on x86-Based Multicore Desktops , 2011, IEEE Software.
[4] Pradeep Dubey,et al. Can traditional programming bridge the Ninja performance gap for parallel computing applications , 2012, ISCA 2012.
[5] William J. Dally. The end of denial architecture and the rise of throughput computing , 2009, DAC 2009.
[6] Kai Li,et al. The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).
[7] Richard Henderson,et al. Multi-platform auto-vectorization , 2006, International Symposium on Code Generation and Optimization (CGO'06).
[8] Jin Zhou,et al. Bamboo: a data-centric, object-oriented approach to many-core software , 2010, PLDI '10.
[9] F. Black,et al. The Pricing of Options and Corporate Liabilities , 1973, Journal of Political Economy.
[10] Andreas Polze,et al. Joint Forces: From Multithreaded Programming to GPU Computing , 2011, IEEE Software.
[11] Andrew W. Moore,et al. 'N-Body' Problems in Statistical Learning , 2000, NIPS.
[12] Leila Ismail,et al. Performance Evaluation of Convolution on the Cell Broadband Engine Processor , 2011, IEEE Transactions on Parallel and Distributed Systems.
[13] Pradeep Dubey,et al. Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU , 2010, ISCA.
[14] M. Musiela,et al. The Market Model of Interest Rate Dynamics , 1997 .
[15] Katherine Yelick,et al. Auto-tuning stencil codes for cache-based multicore platforms , 2009 .
[16] Martin Fowler,et al. Domain-Specific Languages , 2010, The Addison-Wesley signature series.
[17] Pradeep Dubey,et al. Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort , 2010, SIGMOD Conference.
[18] Hsien-Hsin S. Lee,et al. An optimized 3D-stacked memory architecture by exploiting excessive, high-density TSV bandwidth , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.
[19] Pradeep Dubey,et al. FAST: fast architecture sensitive tree search on modern CPUs and GPUs , 2010, SIGMOD Conference.
[20] Samuel Williams,et al. Optimization of a lattice Boltzmann computation on state-of-the-art multicore platforms , 2009, J. Parallel Distributed Comput..
[21] Trevor N. Mudge,et al. Power: A First-Class Architectural Design Constraint , 2001, Computer.
[22] Edward T. Grochowski,et al. Larrabee: A many-Core x86 architecture for visual computing , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).
[23] Yi Yang,et al. A GPGPU compiler for memory optimization and parallelism management , 2010, PLDI '10.
[24] Guy E. Blelloch,et al. Low depth cache-oblivious algorithms , 2010, SPAA '10.
[25] Milind Girkar,et al. Compiling C/C++ SIMD Extensions for Function and Loop Vectorizaion on Multicore-SIMD Processors , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.
[26] Pradeep Dubey,et al. Efficient implementation of sorting on multi-core SIMD CPU architecture , 2008, Proc. VLDB Endow..
[27] Michael C. Sukop,et al. Lattice Boltzmann Modeling: An Introduction for Geoscientists and Engineers , 2005 .
[28] Pradeep Dubey,et al. Closing the Ninja Performance Gap through Traditional Programming and Compiler Technology , 2012 .
[29] Sverre J. Aarseth. Gravitational N-Body Simulations: Tools and Algorithms , 2003 .
[30] Pradeep Dubey,et al. 3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[31] Richard W. Vuduc,et al. Direct N-body Kernels for Multicore Platforms , 2009, 2009 International Conference on Parallel Processing.
[32] Rohit Chandra,et al. Parallel programming in openMP , 2000 .
[33] Ayal Zaks,et al. Outer-loop vectorization - revisited for short SIMD architectures , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).
[34] Pradeep Dubey,et al. Convergence of Recognition, Mining, and Synthesis Workloads and Its Implications , 2008, Proceedings of the IEEE.
[35] Uday Bondhugula,et al. Can CPUs Match GPUs on Performance with Productivity?: Experiences with Optimizing a FLOP-intensive Application on CPUs and GPU , 2010 .
[36] Mahmut T. Kandemir,et al. Cache topology aware computation mapping for multicores , 2010, PLDI '10.
[37] Pradeep Dubey,et al. Mapping High-Fidelity Volume Rendering for Medical Imaging to CPU, GPU and Many-Core Architectures , 2009, IEEE Transactions on Visualization and Computer Graphics.
[38] M. Knaup,et al. Hyperfast Perspective Cone--Beam Backprojection , 2006, 2006 IEEE Nuclear Science Symposium Conference Record.
[39] Samuel Williams,et al. The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .
[40] Wen-mei W. Hwu,et al. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.
[41] William J. Dally,et al. A portable runtime interface for multi-level memory hierarchies , 2008, PPoPP.