Accelerating Atmospheric Modeling Through Emerging Multi-core Technologies

The new generations of multi-core chipset architectures achieve unprecedented levels of computational power while respecting physical and economical constraints. The cost of this power is bewildering program complexity. Atmospheric modeling is a grand-challenge problem that could make good use of these architectures if they were more accessible to the average programmer. To that end, software tools and programming methodologies that greatly simplify the acceleration of atmospheric modeling and simulation with emerging multi-core technologies are developed. A general model is developed to simulate atmospheric chemical transport and atmospheric chemical kinetics. The Cell Broadband Engine Architecture (CBEA), General Purpose Graphics Processing Units (GPGPUs), and homogeneous multi-core processors (e.g. Intel Quad-core Xeon) are introduced. These architectures are used in case studies of transport modeling and kinetics modeling and demonstrate per-kernel speedups as high as 40×. A general analysis and code generation tool for chemical kinetics called " KPPA " is developed. KPPA generates highly tuned C, Fortran, or Matlab code that uses every layer of heterogeneous parallelism in the CBEA, GPGPU, and homogeneous multi-core architectures. A scalable method for simulating chemical transport is also developed. The Weather Research and Forecasting Model with Chemistry (WRF-Chem) is accelerated with these methods with good results: real forecasts of air quality are generated for the Eastern United States 65% faster than the state-of-the-art models. Dedication To all the friends and family members who cheered me on and up. Particularly to my good friends and " adopted parents " Andrew and Denyse Sanderson. Without their generous support, kindness, and cups of tea this would not have been possible. And to my fiancée Katherine Wooten, who ran with me and fed me.

[1]  R. Turco,et al.  SMVGEAR: A sparse-matrix, vectorized gear code for atmospheric models , 1994 .

[2]  Edward T. Grochowski,et al.  Larrabee: A many-Core x86 architecture for visual computing , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[3]  E. Hairer,et al.  Solving Ordinary Differential Equations II: Stiff and Differential-Algebraic Problems , 2010 .

[4]  Kunle Olukotun,et al.  Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.

[5]  Kue-Hwan Sihn,et al.  Analysis and Parallelization of H.264 decoder on Cell Broadband Engine Architecture , 2007, 2007 IEEE International Symposium on Signal Processing and Information Technology.

[6]  David F. Heidel,et al.  An Overview of the BlueGene/L Supercomputer , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[7]  J. Verwer,et al.  Analysis of operator splitting for advection-diffusion-reaction problems from air pollution modelling , 1999 .

[8]  William C. Skamarock,et al.  A time-split nonhydrostatic atmospheric model for weather research and forecasting applications , 2008, J. Comput. Phys..

[9]  Adrian Sandu,et al.  A communication library for the parallelization of air quality models on structured grids , 2002 .

[10]  Himanshu Rawat,et al.  Implementation of Spatial Domain Filters for Cell Broadband Engine , 2008, 2008 First International Conference on Emerging Trends in Engineering and Technology.

[11]  Jordan G. Powers,et al.  A Description of the Advanced Research WRF Version 2 , 2005 .

[12]  Sam S. Stone,et al.  MCUDA: An Efficient Implementation of CUDA Kernels on Multi-cores , 2011 .

[13]  Adrian Sandu,et al.  Scalable heterogeneous parallelism for atmospheric modeling and simulation , 2010, The Journal of Supercomputing.

[15]  Eitan Grinspun,et al.  Sparse matrix solvers on the GPU: conjugate gradients and multigrid , 2003, SIGGRAPH Courses.

[16]  Jerome D. Fast,et al.  Model for Simulating Aerosol Interactions and Chemistry (MOSAIC) , 2008 .

[17]  Nicholas J. Wright,et al.  WRF nature run , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[18]  Guang R. Gao,et al.  Optimizing the Fast Fourier Transform on a Multi-core Architecture , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[19]  Florian A. Potra,et al.  The kinetic preprocessor KPP*/a software environment for solving chemical kinetics , 2002 .

[20]  Manish Vachharajani,et al.  Deconstructing Hardware Usage for General Purpose Computation on GPUs , 2006 .

[21]  Haitao Wei,et al.  Loading OpenMP to Cell: An Effective Compiler Framework for Heterogeneous Multi-core Chip , 2007, IWOMP.

[22]  L. K. Peters,et al.  A second generation model for regional-scale transport/chemistry/deposition , 1986 .

[23]  James Demmel,et al.  LU, QR and Cholesky Factorizations using Vector Capabilities of GPUs , 2008 .

[24]  Tao Zhang,et al.  Supporting OpenMP on Cell , 2008, International Journal of Parallel Programming.

[25]  FengWu-chun,et al.  The Green500 List , 2007 .

[26]  Jack J. Dongarra,et al.  Implementation of mixed precision in solving systems of linear equations on the Cell processor , 2007, Concurr. Comput. Pract. Exp..

[27]  Timothy Mark Pinkston,et al.  On Characterizing Performance of the Cell Broadband Engine Element Interconnect Bus , 2007, First International Symposium on Networks-on-Chip (NOCS'07).

[28]  Sadaf R. Alam,et al.  On the Path to Enable Multi-scale Biomolecular Simulations on PetaFLOPS Supercomputer with Multi-core Processors , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[29]  P. Hanrahan,et al.  Sequoia: Programming the Memory Hierarchy , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[30]  Jung Ho Ahn,et al.  Merrimac: Supercomputing with Streams , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[31]  Qing Wang,et al.  Speech Codec Optimization Based on Cell Broadband Engine , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[32]  H. H. Rosenbrock,et al.  Some general implicit processes for the numerical solution of differential equations , 1963, Comput. J..

[33]  Murali Krishna,et al.  Feasibility study of MPI implementation on the heterogeneous multi-core cell BE™ architecture , 2007, SPAA '07.

[34]  Adrian Sandu,et al.  Implementation and evaluation of an array of chemical solvers in the Global Chemical Transport Model GEOS-Chem , 2009 .

[35]  Jack J. Dongarra,et al.  A Note on Auto-tuning GEMM for GPUs , 2009, ICCS.

[36]  Alexandros Stamatakis,et al.  Dynamic multigrain parallelization on the cell broadband engine , 2007, PPoPP.

[37]  Murali Krishna,et al.  A Buffered-Mode MPI Implementation for the Cell BETM Processor , 2007, International Conference on Computational Science.

[38]  Gargi Dasgupta,et al.  Transparent grid enablement of weather research and forecasting , 2008, Mardi Gras Conference.

[39]  D. Jacob,et al.  Global modeling of tropospheric chemistry with assimilated meteorology : Model description and evaluation , 2001 .

[40]  B. Sportisse An Analysis of Operator Splitting Techniques in the Stiff Case , 2000 .

[41]  Greg Burns,et al.  LAM: An Open Cluster Environment for MPI , 2002 .

[42]  J. Brandts [Review of: W. Hundsdorfer, J.G. Verwer (2003) Numerical Solution of Time-Dependent Advection-Diffusion-Reaction Equations] , 2006 .

[43]  Willem Hundsdorfer,et al.  RKC time-stepping for advection-diffusion-reaction problems , 2004 .

[44]  Kevin Skadron,et al.  Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[45]  John Christian Linford,et al.  Detecting Load Imbalance in Massively Parallel Applications Internship Report , 2009 .

[46]  Pat Hanrahan,et al.  Brook for GPUs: stream computing on graphics hardware , 2004, ACM Trans. Graph..

[47]  Lisa Schweitzer,et al.  The Sustainable Mobility Learning Laboratory: Interactive Web-Based Education on Transportation and the Environment , 2008 .

[48]  R. Jackson,et al.  General mass action kinetics , 1972 .

[49]  Samuel Williams,et al.  The potential of the cell processor for scientific computing , 2005, CF '06.

[50]  J. Verwer Explicit Runge-Kutta methods for parabolic partial differential equations , 1996 .

[51]  Paulette Middleton,et al.  Aggregation and analysis of volatile organic compound emissions for regional modeling , 1990 .

[52]  Felix Wolf,et al.  Replay-Based Synchronization of Timestamps in Event Traces of Massively Parallel Applications , 2008, 2008 International Conference on Parallel Processing - Workshops.

[53]  William J. Dally,et al.  The Imagine Stream Processor , 2002, Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors.

[54]  J. Lambert Numerical Methods for Ordinary Differential Systems: The Initial Value Problem , 1991 .

[55]  Anwar Ghuloum Future Proof Data Parallel Algorithms and Software on Intel Multicore Architecture , 2007 .

[56]  Julien Langou,et al.  Mixed Precision Iterative Refinement Techniques for the Solution of Dense Linear Systems , 2007, Int. J. High Perform. Comput. Appl..

[57]  Pawel Gepner,et al.  Second Generation Quad-Core Intel Xeon Processors Bring 45 nm Technology and a New Level of Performance to HPC Applications , 2008, ICCS.

[58]  Philip S. Yu,et al.  SPADE: the system s declarative stream processing engine , 2008, SIGMOD Conference.

[59]  Kwoh Chee Keong,et al.  Applications of Heterogeneous Structure of Cell Broadband Engine Architecture for Biological Database Similarity Search , 2008, 2008 2nd International Conference on Bioinformatics and Biomedical Engineering.

[60]  Tuning and Analysis Utilities , 2011, Encyclopedia of Parallel Computing.

[61]  V. Natoli,et al.  Exploring New Architectures in Accelerating CFD for Air Force Applications , 2008, 2008 DoD HPCMP Users Group Conference.

[62]  Rüdiger Westermann,et al.  Linear algebra operators for GPU implementation of numerical algorithms , 2003, SIGGRAPH Courses.

[63]  Kalyan S. Perumalla Discrete-event Execution Alternatives on General Purpose Graphical Processing Units (GPGPUs) , 2006, 20th Workshop on Principles of Advanced and Distributed Simulation (PADS'06).

[64]  Bo Li,et al.  Optimized Implementation of Ray Tracing on Cell Broadband Engine , 2008, 2008 International Conference on Multimedia and Ubiquitous Engineering (mue 2008).

[65]  Charles Hirsch,et al.  Numerical computation of internal and external flows (vol1: Fundamentals of numerical discretization) , 1991 .

[66]  Dinesh Manocha,et al.  LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[67]  Arie E. Kaufman,et al.  GPU Cluster for High Performance Computing , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[68]  Manish Vachharajani,et al.  GPU acceleration of numerical weather prediction , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[69]  Felix Wolf,et al.  Scalable timestamp synchronization for event traces of message-passing applications , 2009, Parallel Comput..

[70]  Guang R. Gao,et al.  Software-Pipelining on Multi-Core Architectures , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[71]  Linda R. Petzold,et al.  Runge-Kutta-Chebyshev projection method , 2006, J. Comput. Phys..

[72]  Assyr Abdulle,et al.  Second order Chebyshev methods based on orthogonal polynomials , 2001, Numerische Mathematik.

[73]  B. Flachs,et al.  The microarchitecture of the synergistic processor for a cell processor , 2006, IEEE Journal of Solid-State Circuits.

[74]  Steven G. Johnson,et al.  FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[75]  Adrian Sandu,et al.  Adjoint sensitivity analysis of regional air quality models , 2005 .

[76]  Dimitrios S. Nikolopoulos,et al.  Dma-based prefetching for i/o-intensive workloads on the cell architecture , 2008, CF '08.

[77]  William J. Dally,et al.  Executing irregular scientific applications on stream architectures , 2007, ICS '07.

[78]  Jason N. Dale,et al.  Cell Broadband Engine Architecture and its first implementation - A performance view , 2007, IBM J. Res. Dev..

[79]  Mark D. Hill,et al.  Amdahl's Law in the Multicore Era , 2008, Computer.

[80]  John L. Klepeis,et al.  Anton, a special-purpose machine for molecular dynamics simulation , 2007, ISCA '07.

[81]  Guang R. Gao,et al.  Programming Experience on Cyclops-64 Multi-Core Chip Architecture , 2022 .

[82]  Adrian Sandu,et al.  Improved Quasi-Steady-State-Approximation Methods for Atmospheric Chemistry Integration , 1997, SIAM J. Sci. Comput..

[83]  David A. Bader,et al.  High performance MPEG-2 software decoder on the cell broadband engine , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[84]  David Gregg,et al.  Streamlining Offload Computing to High Performance Architectures , 2009, ICCS.

[85]  Adrian Sandu,et al.  Optimizing large scale chemical transport models for multicore platforms , 2008, SpringSim '08.

[86]  Willem Hundsdorfer,et al.  Numerical Solution of Advection-Diffusion-Reaction Equations , 1996 .

[87]  Keith A. Duke,et al.  A Professional Graphics Controller , 1985, IBM Syst. J..

[88]  Adrian Sandu,et al.  Vector stream processing for effective application of heterogeneous parallelism , 2009, SAC '09.

[89]  Michael Lang,et al.  Entering the petaflop era: The architecture and performance of Roadrunner , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[90]  F. Kirchner,et al.  A new mechanism for regional atmospheric chemistry modeling , 1997 .

[91]  M. C. Dodge,et al.  A photochemical kinetics mechanism for urban and regional scale computer modeling , 1989 .

[92]  Georg A. Grell,et al.  Fully coupled “online” chemistry within the WRF model , 2005 .

[93]  Khaled Z. Ibrahim,et al.  Implementing Wilson-Dirac operator on the cell broadband engine , 2008, ICS '08.

[94]  G. Strang On the Construction and Comparison of Difference Schemes , 1968 .

[95]  W. Stockwell,et al.  The second generation regional acid deposition model chemical mechanism for regional air quality modeling , 1990 .

[96]  Benjamin Rose,et al.  A comparison of programming models for multiprocessors with explicitly managed memory hierarchies , 2009, PPoPP '09.

[97]  William P. L. Carter A DETAILED MECHANISM FOR THE GAS-PHASE ATMOSPHERIC REACTIONS OF ORGANIC COMPOUNDS , 1990 .

[98]  Hidemasa Muta,et al.  Multilevel parallelization on the cell/B.E. for a motion JPEG 2000 encoding server , 2007, ACM Multimedia.

[99]  H. Peter Hofstee,et al.  Introduction to the Cell multiprocessor , 2005, IBM J. Res. Dev..

[100]  J. Verwer,et al.  Numerical solution of time-dependent advection-diffusion-reaction equations , 2003 .

[101]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[102]  Adrian Sandu,et al.  Performance of stabilized explicit time integration methods for parallel air quality models , 2007, SpringSim '07.

[103]  Victor Eijkhout,et al.  Self-Adapting Linear Algebra Algorithms and Software , 2005, Proceedings of the IEEE.

[104]  D. Byun Science algorithms of the EPA Models-3 community multi-scale air quality (CMAQ) modeling system , 1999 .

[105]  I. Wald,et al.  Ray Tracing on the Cell Processor , 2006, 2006 IEEE Symposium on Interactive Ray Tracing.

[106]  H. Najm,et al.  High-order spatial discretizations and extended stability methods for reacting flows on structured adaptively refined meshes , 2022 .

[107]  Renjian Zhang,et al.  Evaluation of the Models-3 Community Multi-scale Air Quality (CMAQ) modeling system with observations obtained during the TRACE-P experiment: Comparison of ozone and its related species , 2006 .