Assessing a mini‐application as a performance proxy for a finite element method engineering application

The performance of a large‐scale, production‐quality science and engineering application (‘app’) is often dominated by a small subset of the code. Even within that subset, computational and data access patterns are often repeated, so that an even smaller portion can represent the performance‐impacting features. If application developers, parallel computing experts, and computer architects can together identify this representative subset and then develop a small mini‐application (‘miniapp’) that can capture these primary performance characteristics, then this miniapp can be used to both improve the performance of the app as well as provide a tool for co‐design for the high‐performance computing community. However, a critical question is whether a miniapp can effectively capture key performance behavior of an app. This study provides a comparison of an implicit finite element semiconductor device modeling app on unstructured meshes with an implicit finite element miniapp on unstructured meshes. The goal is to assess whether the miniapp is predictive of the performance of the app. Single compute node performance will be compared, as well as scaling up to 16,000 cores. Results indicate that the miniapp can be reasonably predictive of the performance characteristics of the app for a single iteration of the solver on a single compute node. Published 2015. This article is a U.S. Government work and is in the public domain in the USA.

[1]  Y. Saad,et al.  GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems , 1986 .

[2]  Sandia Report,et al.  Summary of Work for ASC L2 Milestone 4465: Characterize the Role of the Mini-Application in Predicting Key Performance Characteristics of Real Applications , 2012 .

[3]  Albuquerque,et al.  Improving multigrid performance for unstructured mesh drift–diffusion simulations on 147,000 cores , 2012 .

[4]  Andreas Griewank,et al.  Evaluating derivatives - principles and techniques of algorithmic differentiation, Second Edition , 2000, Frontiers in applied mathematics.

[5]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[6]  Tamara G. Kolda,et al.  An overview of the Trilinos project , 2005, TOMS.

[7]  David A. Bader Designing Scalable Synthetic Compact Applications for Benchmarking High Productivity Computing Systems , 2006 .

[8]  Paul T. Lin,et al.  Towards large-scale multi-socket, multicore parallel simulations: Performance of an MPI-only semiconductor device simulator , 2010, J. Comput. Phys..

[9]  M. Benzi Preconditioning techniques for large linear systems: a survey , 2002 .

[10]  Paul T. Lin,et al.  Performance of a parallel algebraic multilevel preconditioner for stabilized finite element semiconductor device modeling , 2009, J. Comput. Phys..

[11]  Kevin M. Kramer,et al.  Semiconductor Devices: A Simulation Approach , 1997 .

[12]  Courtenay T. Vaughan,et al.  Navigating an Evolutionary Fast Path to Exascale , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[13]  J. Reddy,et al.  The Finite Element Method in Heat Transfer and Fluid Dynamics , 1994 .

[14]  Sandia Report,et al.  Improving Performance via Mini-applications , 2009 .

[15]  Dong Li,et al.  Quantifying Architectural Requirements of Contemporary Extreme-Scale Scientific Applications , 2013, PMBS@SC.

[16]  Jack Dongarra,et al.  Introduction to the HPCChallenge Benchmark Suite , 2004 .

[17]  C. T. Vaughan,et al.  Assessing the role of mini-applications in predicting key performance characteristics of scientific and engineering applications , 2015, J. Parallel Distributed Comput..

[18]  Roger P. Pawlowski,et al.  Efficient Expression Templates for Operator Overloading-based Automatic Differentiation , 2012, ArXiv.

[19]  John N. Shadid,et al.  Simulation of neutron radiation damage in silicon semiconductor devices. , 2007 .

[20]  T. Hughes,et al.  A theoretical framework for Petrov-Galerkin methods with discontinuous weighting functions: application to the streamline-upwind procedure. , 1982 .

[21]  Jonathan J. Hu,et al.  ML 5.0 Smoothed Aggregation Users's Guide , 2006 .

[22]  Sadaf R. Alam,et al.  Characterization of Scientific Workloads on Systems with Multi-Core Processors , 2006, 2006 IEEE International Symposium on Workload Characterization.

[23]  Richard F. Barrett,et al.  Exascale design space exploration and co-design , 2014, Future Gener. Comput. Syst..

[24]  David Abramson,et al.  The Virtual Laboratory: a toolset to enable distributed molecular modelling for drug design on the World‐Wide Grid , 2003, Concurr. Comput. Pract. Exp..

[25]  J. Dongarra,et al.  The Impact of Multicore on Computational Science Software , 2007 .

[26]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[27]  Jonathan J. Hu,et al.  ML 3.1 smoothed aggregation user's guide. , 2004 .

[28]  Jack J. Dongarra,et al.  The LINPACK Benchmark: past, present and future , 2003, Concurr. Comput. Pract. Exp..

[29]  John N. Shadid,et al.  Aztec user`s guide. Version 1 , 1995 .

[30]  Henk A. van der Vorst,et al.  Bi-CGSTAB: A Fast and Smoothly Converging Variant of Bi-CG for the Solution of Nonsymmetric Linear Systems , 1992, SIAM J. Sci. Comput..

[31]  Daisuke Takahashi,et al.  The HPC Challenge (HPCC) benchmark suite , 2006, SC.