Performance of MD-Algorithms on Hybrid Systems-on-Chip Nvidia Tegra K1 & X1

In this paper we consider the efficiency of hybrid systems-on-a-chip for high-performance calculations. Firstly, we build Roofline performance models for the systems considered using Empirical Roofline Toolkit and compare the results with the theoretical estimates. Secondly, we use LAMMPS as an example of the molecular dynamic package to demonstrate its performance and efficiency in various configurations running on Nvidia Tegra K1 & X1. Following the Roofline approach, we attempt to distinguish compute-bound and memory-bound conditions for the MD algorithm using the Lennard-Jones liquid model. The results are discussed in the context of the LAMMPS performance on Intel Xeon CPUs and the Nvidia Tesla K80 GPU.

[1]  Vladimir V. Stegailov,et al.  Floating-point performance of ARM cores and their efficiency in classical molecular dynamics , 2016 .

[2]  Pak Lui,et al.  Strong scaling of general-purpose molecular dynamics simulations on GPUs , 2014, Comput. Phys. Commun..

[3]  Patricia J. Teller,et al.  Cross-Accelerator Performance Profiling , 2016, XSEDE.

[4]  Daniel Sunderland,et al.  Kokkos: Enabling manycore performance portability through polymorphic memory access patterns , 2014, J. Parallel Distributed Comput..

[5]  Thomas Scogland,et al.  Node variability in large-scale power measurements: perspectives from the Green500, Top500 and EEHPCWG , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[6]  Eric A. Freudenthal,et al.  Preliminary Investigation of Mobile System Features Potentially Relevant to HPC , 2016, 2016 4th International Workshop on Energy Efficient Supercomputing (E2SC).

[7]  Sandia Report,et al.  Improving Performance via Mini-applications , 2009 .

[8]  Vladimir V. Stegailov,et al.  Efficiency of classical molecular dynamics algorithms on supercomputers , 2016 .

[9]  Steve Plimpton,et al.  Fast parallel algorithms for short-range molecular dynamics , 1993 .

[10]  Vladimir V. Stegailov,et al.  HPC Hardware Efficiency for Quantum and Classical Molecular Dynamics , 2015, PaCT.

[11]  Brian A. Wichmann,et al.  A Synthetic Benchmark , 1976, Comput. J..

[12]  Torsten Hoefler,et al.  Scientific Benchmarking of Parallel Computing Systems Twelve ways to tell the masses when reporting performance results , 2017 .

[13]  Stefano Piana,et al.  Assessing the accuracy of physical models used in protein-folding simulations: quantitative evidence from long molecular dynamics simulations. , 2014, Current opinion in structural biology.

[14]  Andrey Andreev,et al.  The Co-design of Astrophysical Code for Massively Parallel Supercomputers , 2016, ICA3PP Workshops.

[15]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[16]  Roman Wyrzykowski,et al.  Systematic adaptation of stencil‐based 3D MPDATA to GPU architectures , 2017, Concurr. Comput. Pract. Exp..

[17]  Vladimir V. Stegailov,et al.  Efficiency of the Tegra K1 and X1 systems-on-chip for classical molecular dynamics , 2016, 2016 International Conference on High Performance Computing & Simulation (HPCS).

[18]  Jack J. Dongarra,et al.  MAGMA embedded: Towards a dense linear algebra library for energy efficient extreme computing , 2015, 2015 IEEE High Performance Extreme Computing Conference (HPEC).

[19]  John E. Stone,et al.  Evaluation of Emerging Energy-Efficient Heterogeneous Computing Platforms for Biomolecular and Cellular Simulation Workloads , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[20]  Steven J. Plimpton,et al.  Implementing molecular dynamics on hybrid high performance computers - Particle-particle particle-mesh , 2012, Comput. Phys. Commun..

[21]  Ananta Tiwari,et al.  Characterization and bottleneck analysis of a 64-bit ARMv8 platform , 2016, 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[22]  Masha Sosonkina,et al.  Energy-Efficient Computational Chemistry: Comparison of x86 and ARM Systems. , 2015, Journal of chemical theory and computation.

[23]  Erich Strohmaier,et al.  Apex-Map: A Global Data Access Benchmark to Analyze HPC Systems and Parallel Programming Paradigms , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[24]  Hans-Joachim Bungartz,et al.  591 TFLOPS Multi-trillion Particles Simulation on SuperMUC , 2013, ISC.

[25]  Jun Zhou,et al.  Use of SIMD Vector Operations to Accelerate Application Code Performance on Low-Powered ARM and Intel Platforms , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[26]  Peng Wang,et al.  Implementing molecular dynamics on hybrid high performance computers - short range forces , 2011, Comput. Phys. Commun..

[27]  Samuel Williams,et al.  Roofline Model Toolkit: A Practical Tool for Architectural and Program Analysis , 2014, PMBS@SC.

[28]  David R. Kaeli,et al.  Performance of the NVIDIA Jetson TK1 in HPC , 2015, 2015 IEEE International Conference on Cluster Computing.