High Performance Computing

On the way to Exascale, programmers face the increasing challenge of having to support multiple hardware architectures from the same code base. At the same time, portability of code and performance are increasingly difficult to achieve as hardware architectures are becoming more and more diverse. Today’s heterogeneous systems often include two or more completely distinct and incompatible hardware execution models, such as GPGPU’s, SIMD vector units, and general purpose cores which conventionally have to be programmed using separate tool chains representing non-overlapping programming models. The recent revival of interest in the industry and the wider community for the C++ language has spurred a remarkable amount of standardization proposals and technical specifications in the arena of concurrency and parallelism. This recently includes an increasing amount of discussion around the need for a uniform, higher-level abstraction and programming model for parallelism in the C++ standard targeting heterogeneous and distributed computing. Such an abstraction should perfectly blend with existing, already standardized language and library features, but should also be generic enough to support future hardware developments. In this paper, we present the results from developing such a higher-level programming abstraction for parallelism in C++ which aims at enabling code and performance portability over a wide range of architectures and for various types of parallelism. We present and compare performance data obtained from running the well-known STREAM benchmark ported to our higher level C++ abstraction with the corresponding results from running it natively. We show that our abstractions enable performance at least as good as the comparable base-line benchmarks while providing a uniform programming API on all compared target architectures. c © Springer International Publishing AG 2016 M. Taufer et al. (Eds.): ISC High Performance Workshops 2016, LNCS 9945, pp. 18–31, 2016. DOI: 10.1007/978-3-319-46079-6 2 Closing the Performance Gap with Modern C++ 19

[1]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[2]  Jennifer Widom,et al.  GPS: a graph processing system , 2013, SSDBM.

[3]  Fabio Checconi,et al.  An Early Performance Study of Large-Scale POWER8 SMP Systems , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[4]  Joseph Gonzalez,et al.  PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.

[5]  Ryan P. Langewisch A performance study of an implementation of the push-relabel maximum flow algorithm in Apache Spark's GraphX , 2016 .

[6]  Christos Faloutsos,et al.  PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[7]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[8]  Alan D. George,et al.  Novo‐G#: a multidimensional torus‐based reconfigurable cluster for molecular dynamics , 2016, Concurr. Comput. Pract. Exp..

[9]  Sangkeun Lee,et al.  Graph Processing Platforms at Scale: Practices and Experiences , 2015, 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[10]  Michael Gschwind,et al.  IBM POWER8 processor core microarchitecture , 2015, IBM J. Res. Dev..

[11]  Michael B. Giles,et al.  Benchmarking the IBM Power8 processor , 2015, CASCON.

[12]  Frank Mueller,et al.  OpenACC acceleration of an unstructured CFD solver based on a reconstructed discontinuous Galerkin method for compressible flows , 2015 .

[13]  Guy E. Blelloch,et al.  Ligra: a lightweight graph processing framework for shared memory , 2013, PPoPP '13.

[14]  Julien Demouth,et al.  GPU Implementation of Finite Difference Solvers , 2014, 2014 Seventh Workshop on High Performance Computational Finance.

[15]  Alan D. George,et al.  Novo-G: At the Forefront of Scalable Reconfigurable Supercomputing , 2011, Computing in Science & Engineering.

[16]  Ali Pinar,et al.  A Simulator for Large-Scale Parallel Computer Architectures , 2010, Int. J. Distributed Syst. Technol..

[17]  Dietmar Fey,et al.  Higher-level parallelization for local and distributed asynchronous task-based programming , 2015, ESPM '15.

[18]  Reynold Xin,et al.  GraphX: Graph Processing in a Distributed Dataflow Framework , 2014, OSDI.

[19]  Paul H. J. Kelly,et al.  Acceleration of a Full-Scale Industrial CFD Application with OP2 , 2014, IEEE Transactions on Parallel and Distributed Systems.

[20]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[21]  Hamid Mushtaq,et al.  Cluster-based Apache Spark implementation of the GATK DNA analysis pipeline , 2015, 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[22]  Bruce Jacob,et al.  The structural simulation toolkit , 2006, PERV.

[23]  Sunita Chandrasekaran,et al.  Exploring Programming Multi-GPUs Using OpenMP and OpenACC-Based Hybrid Model , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[24]  Guillaume Mercier,et al.  hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications , 2010, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing.

[25]  Daniel Sunderland,et al.  Kokkos: Enabling manycore performance portability through polymorphic memory access patterns , 2014, J. Parallel Distributed Comput..

[26]  Simon McIntosh-Smith,et al.  GPU-STREAM: Benchmarking the achievable memory bandwidth of Graphics Processing Units , 2015, SC 2015.

[27]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[28]  I. Z. Reguly,et al.  Vectorizing Unstructured Mesh Computations for Many-core Architectures , 2014, PMAM'14.

[29]  Tamara G. Kolda,et al.  An In-depth Study of Stochastic Kronecker Graphs , 2011, 2011 IEEE 11th International Conference on Data Mining.

[30]  Abhishek Chandra,et al.  Enabling Scalable Social Group Analytics via Hypergraph Analysis Systems , 2015, HotCloud.

[31]  LeskovecJure,et al.  Defining and evaluating network communities based on ground-truth , 2015 .

[32]  Li Zhang,et al.  SparkBench: a comprehensive benchmarking suite for in memory data analytic platform Spark , 2015, Conf. Computing Frontiers.

[33]  Alan D. George,et al.  FASE: A Framework for Scalable Performance Prediction of HPC Systems and Applications , 2007, Simul..

[34]  Richard D. Hornung,et al.  The RAJA Portability Layer: Overview and Status , 2014 .

[35]  Christos Faloutsos,et al.  Kronecker Graphs: An Approach to Modeling Networks , 2008, J. Mach. Learn. Res..

[36]  Ming C. Lin,et al.  Real-time Path Planning for Virtual Agents in Dynamic Environments , 2007, 2007 IEEE Virtual Reality Conference.

[37]  Reynold Xin,et al.  GraphX: a resilient distributed graph system on Spark , 2013, GRADES.

[38]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[39]  Fabio Checconi,et al.  Exploring network optimizations for large-scale graph analytics , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[40]  Wencong Xiao,et al.  GraM: scaling graph computation to the trillions , 2015, SoCC.

[41]  J. C. Helton,et al.  Uncertainty and sensitivity analysis in the presence of stochastic and subjective uncertainty , 1997 .

[42]  Wolfram Schenck,et al.  Performance Evaluation of Scientific Applications on POWER8 , 2014, PMBS@SC.

[43]  Willy Zwaenepoel,et al.  X-Stream: edge-centric graph processing using streaming partitions , 2013, SOSP.

[44]  Timothée Ewart,et al.  Performance evaluation of the IBM POWER8 architecture to support computational neuroscientific application using morphologically detailed neurons , 2015, PMBS '15.