High Performance Computing: 35th International Conference, ISC High Performance 2020, Frankfurt/Main, Germany, June 22–25, 2020, Proceedings

Hash table is a fundamental data structure that provides efficient data store and access. It is a key component in AI applications which rely on building a model of the environment using observations and performing lookups on the model for newer observations. In this work, we develop FASTHash, a “truly” high throughput parallel hash table implementation using FPGA on-chip SRAM. Contrary to stateof-the-art hash table implementations on CPU, GPU, and FPGA, the parallelism in our design is data independent, allowing us to support p parallel queries (p > 1) per clock cycle via p processing engines (PEs) in the worst case. Our novel data organization and query flow techniques allow full utilization of abundant low latency on-chip SRAM and enable conflict free concurrent insertions. Our hash table ensures relaxed eventual consistency inserts from a PE are visible to all PEs with some latency. We provide theoretical worst case bound on the number of erroneous queries (true negative search, duplicate inserts) due to relaxed eventual consistency. We customize our design to implement both static and dynamic hash tables on state-of-the-art FPGA devices. Our implementations are scalable to 16 PEs and support throughput as high as 5360 million operations per second with PEs running at 335 MHz for static hashing and 4480 million operations per second with PEs running at 280 MHz for dynamic hashing. They outperform state-of-the-art implementations by 5.7x and 8.7x respectively.

[1]  Kwan-Liu Ma,et al.  In Situ Visualization at Extreme Scale: Challenges and Opportunities , 2009, IEEE Computer Graphics and Applications.

[2]  M. G. Duffy,et al.  Quadrature Over a Pyramid or Cube of Integrands with a Singularity at a Vertex , 1982 .

[3]  Sylvain Lefebvre,et al.  Coherent parallel hashing , 2011, ACM Trans. Graph..

[4]  Martin Schulz,et al.  Caliper: Performance Introspection for HPC Software Stacks , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[5]  Eamonn J. Keogh,et al.  Matrix Profile II: Exploiting a Novel Algorithm and GPUs to Break the One Hundred Million Barrier for Time Series Motifs and Joins , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[6]  Dan Jiao,et al.  An LU Decomposition Based Direct Integral Equation Solver of Linear Complexity and Higher-Order Accuracy for Large-Scale Interconnect Extraction , 2010, IEEE Transactions on Advanced Packaging.

[7]  Martin Schulz,et al.  Dynamic power sharing for higher job throughput , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[8]  Laxmikant V. Kalé,et al.  A scalable double in-memory checkpoint and restart scheme towards exascale , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012).

[9]  Sergej Rjasanow,et al.  Adaptive Low-Rank Approximation of Collocation Matrices , 2003, Computing.

[10]  F. Rizzo,et al.  A General Algorithm for the Numerical Solution of Hypersingular Boundary Integral Equations , 1992 .

[11]  Mahmut T. Kandemir,et al.  Improving bank-level parallelism for irregular applications , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[12]  T. J. Dekker,et al.  A floating-point technique for extending the available precision , 1971 .

[13]  Eamonn J. Keogh,et al.  Matrix Profile III: The Matrix Profile Allows Visualization of Salient Subsequences in Massive Time Series , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[14]  Kuniaki Uehara,et al.  Discovery of Time-Series Motif from Multi-Dimensional Data Based on MDL Principle , 2005, Machine Learning.

[15]  Hisashi Shima,et al.  Resistive Random Access Memory (ReRAM) Based on Metal Oxides , 2010, Proceedings of the IEEE.

[16]  Aidan Roy,et al.  Fast clique minor generation in Chimera qubit connectivity graphs , 2015, Quantum Inf. Process..

[17]  Kurt Keutzer,et al.  Integrated Model, Batch, and Domain Parallelism in Training Neural Networks , 2017, SPAA.

[18]  Gerhard Wellein,et al.  Efficient Temporal Blocking for Stencil Computations by Multicore-Aware Wavefront Parallelization , 2009, 2009 33rd Annual IEEE International Computer Software and Applications Conference.

[19]  Sudhakar Yalamanchili,et al.  Harmonia: Balancing compute and memory power in high-performance GPUs , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[20]  Dirk Pflüger,et al.  A Massively-Parallel, Fault-Tolerant Solver for High-Dimensional PDEs , 2016, Euro-Par Workshops.

[21]  Albert-László Barabási,et al.  Statistical mechanics of complex networks , 2001, ArXiv.

[22]  Stephen W. Poole,et al.  Overlapping computation and communication: Barrier algorithms and ConnectX-2 CORE-Direct capabilities , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[23]  Nuno Constantino Castro,et al.  Time Series Data Mining , 2009, Encyclopedia of Database Systems.

[24]  Allen D. Malony,et al.  The Tau Parallel Performance System , 2006, Int. J. High Perform. Comput. Appl..

[25]  Julien Langou,et al.  A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures , 2007, Parallel Comput..

[26]  Roger W. Hockney,et al.  The Communication Challenge for MPP: Intel Paragon and Meiko CS-2 , 1994, Parallel Computing.

[27]  Rachata Ausavarungnirun,et al.  Row buffer locality aware caching policies for hybrid memories , 2012, 2012 IEEE 30th International Conference on Computer Design (ICCD).

[28]  Sayantan Sur,et al.  High-performance and scalable non-blocking all-to-all with collective offload on InfiniBand clusters: a study with parallel 3D FFT , 2011, Computer Science - Research and Development.

[29]  Eamonn J. Keogh,et al.  Matrix Profile XI: SCRIMP++: Time Series Motif Discovery at Interactive Speeds , 2018, 2018 IEEE International Conference on Data Mining (ICDM).

[30]  Gerhard Wellein,et al.  Chebyshev Filter Diagonalization on Modern Manycore Processors and GPGPUs , 2018, ISC.

[31]  Michael Dumbser,et al.  A simple diffuse interface approach on adaptive Cartesian grids for the linear elastic wave equations with complex topography , 2018, J. Comput. Phys..

[32]  Guy E. Blelloch,et al.  Vector Models for Data-Parallel Computing , 1990 .

[33]  Christophe de Maindreville,et al.  A Parallel Strategy for Transitive Closure usind Double Hash-Based Clustering , 1990, VLDB.

[34]  Kwan-Liu Ma,et al.  In-situ processing and visualization for ultrascale simulations , 2007 .

[35]  Phillip Stanley-Marbell,et al.  Pinned to the walls — Impact of packaging and application properties on the memory and power walls , 2011, IEEE/ACM International Symposium on Low Power Electronics and Design.

[36]  Mahmut T. Kandemir,et al.  Evaluating STT-RAM as an energy-efficient main memory alternative , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[37]  Sayantan Sur,et al.  Design and Implementation of OpenSHMEM Using OFI on the Aries Interconnect , 2016, OpenSHMEM.

[38]  M. Bonnet Boundary Integral Equation Methods for Solids and Fluids , 1999 .

[39]  Alok Aggarwal,et al.  Regularized Evolution for Image Classifier Architecture Search , 2018, AAAI.

[40]  Robert Dietrich,et al.  OMPT: An OpenMP Tools Application Programming Interface for Performance Analysis , 2013, IWOMP.

[41]  Mark W. Johnson,et al.  Architectural Considerations in the Design of a Superconducting Quantum Annealing Processor , 2014, IEEE Transactions on Applied Superconductivity.

[42]  Kiyoung Choi,et al.  An FPGA implementation of high-throughput key-value store using Bloom filter , 2014, Technical Papers of 2014 International Symposium on VLSI Design, Automation and Test.

[43]  Nathan R. Tallent,et al.  HPCTOOLKIT: tools for performance analysis of optimized parallel programs , 2010, Concurr. Comput. Pract. Exp..

[44]  Kenny Gruchalla,et al.  Prediction and characterization of application power use in a high‐performance computing environment , 2017, Stat. Anal. Data Min..

[45]  Alexander Aiken,et al.  Legion: Expressing locality and independence with logical regions , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[46]  Scott Klasky,et al.  Comparing the Efficiency of In Situ Visualization Paradigms at Scale , 2019, ISC.

[47]  George Bosilca,et al.  Fault tolerant high performance computing by a coding approach , 2005, PPoPP.

[48]  Sadasiva Rao,et al.  Elimination of internal resonance problem associated with acoustic scattering by three-dimensional rigid body , 2004 .

[49]  Eric Darve,et al.  A fast block low-rank dense solver with applications to finite-element matrices , 2014, J. Comput. Phys..

[50]  Tyler A. Simon,et al.  Improving Application Resilience through Probabilistic Task Replication , 2013 .

[51]  Laxmikant V. Kalé,et al.  Optimizing power allocation to CPU and memory subsystems in overprovisioned HPC systems , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).

[52]  Thomas Rauber,et al.  Performance Prediction of Explicit ODE Methods on Multi-Core Cluster Systems , 2019, ICPE.

[53]  Courtenay T. Vaughan,et al.  Evaluating the Intel Skylake Xeon Processor for HPC Workloads , 2018, 2018 International Conference on High Performance Computing & Simulation (HPCS).

[54]  Katherine E. Isaacs,et al.  There goes the neighborhood: Performance degradation due to nearby jobs , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[55]  Samuel Williams,et al.  Roofline Model Toolkit: A Practical Tool for Architectural and Program Analysis , 2014, PMBS@SC.

[56]  Lixin Gao,et al.  A Fault-Tolerant Framework for Asynchronous Iterative Computations in Cloud Environments , 2018, IEEE Transactions on Parallel and Distributed Systems.

[57]  Jesús Labarta,et al.  A dependency-aware task-based programming environment for multi-core architectures , 2008, 2008 IEEE International Conference on Cluster Computing.

[58]  Henk Corporaal,et al.  Roofline-aware DVFS for GPUs , 2014, ADAPT '14.

[59]  Eduardo Cesar Galobardes,et al.  Automatic Tuning of HPC Applications. The Periscope Tuning Framework , 2015 .

[60]  Vincent Chen,et al.  Achieving Portable Performance For Wavelet Compression Using Data Parallel Primitives , 2017, EGPGV@EuroVis.

[61]  Ross Duncan,et al.  On the qubit routing problem , 2019, TQC.

[62]  Henry Hoffmann,et al.  CoPPer: Soft Real-Time Application Performance Using Hardware Power Capping , 2019, 2019 IEEE International Conference on Autonomic Computing (ICAC).

[63]  George Bosilca,et al.  Plan B: Interruption of Ongoing MPI Operations to Support Failure Recovery , 2015, EuroMPI.

[64]  Yonghong Song,et al.  Design and implementation of a compiler framework for helper threading on multi-core processors , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[65]  Frank Mueller,et al.  End-to-End Resilience for HPC Applications , 2019, ISC.

[66]  Thomas Hérault,et al.  Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[67]  Sven Leyffer,et al.  Optimal scheduling of in-situ analysis for large-scale scientific simulations , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[68]  Joonyoung Kim,et al.  HBM: Memory solution for bandwidth-hungry processors , 2014, 2014 IEEE Hot Chips 26 Symposium (HCS).

[69]  P. Sadayappan,et al.  Adaptive sparse tiling for sparse matrix multiplication , 2019, PPoPP.

[70]  Larry Carter,et al.  Universal classes of hash functions (Extended Abstract) , 1977, STOC '77.

[71]  Pascal Fua,et al.  SLIC Superpixels Compared to State-of-the-Art Superpixel Methods , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[72]  Ran Ginosar,et al.  RASSA: Resistive Prealignment Accelerator for Approximate DNA Long Read Mapping , 2018, IEEE Micro.

[73]  Gerhard Wellein,et al.  LIKWID: A Lightweight Performance-Oriented Tool Suite for x86 Multicore Environments , 2010, 2010 39th International Conference on Parallel Processing Workshops.

[74]  Yuan He,et al.  Demand-Aware Power Management for Power-Constrained HPC Systems , 2016, 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid).

[75]  Mateo Valero,et al.  Optimizing computation-communication overlap in asynchronous task-based programs , 2019, ICS.

[76]  Martin Schulz,et al.  Evaluating User-Level Fault Tolerance for MPI Applications , 2014, EuroMPI/ASIA.

[77]  Laxmikant V. Kalé,et al.  Adaptive MPI , 2003, LCPC.

[78]  Franck Cappello,et al.  Unified fault-tolerance framework for hybrid task-parallel message-passing applications , 2018, Int. J. High Perform. Comput. Appl..

[79]  Simon McIntosh-Smith,et al.  A performance analysis of the first generation of HPC‐optimized Arm processors , 2019, Concurr. Comput. Pract. Exp..

[80]  Kai Wang,et al.  RStream: Marrying Relational Algebra with Streaming for Efficient Graph Mining on A Single Machine , 2018, OSDI.

[81]  Pieter Ghysels,et al.  A Distributed-Memory Package for Dense Hierarchically Semi-Separable Matrix Computations Using Randomization , 2015, ACM Trans. Math. Softw..

[82]  Viktor K. Prasanna,et al.  High Performance Linear Algebra Operations on Reconfigurable Systems , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[83]  Dhabaleswar K. Panda,et al.  Efficient design for MPI asynchronous progress without dedicated resources , 2019, Parallel Comput..

[84]  Laxmikant V. Kalé,et al.  Performance evaluation of automatic checkpoint-based fault tolerance for AMPI and Charm++ , 2006, OPSR.

[85]  Thomas Hérault,et al.  Post-failure recovery of MPI communication capability , 2013, Int. J. High Perform. Comput. Appl..

[86]  Jian Li,et al.  Power shifting in Thrifty Interconnection Network , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[87]  David E. Keyes,et al.  Exploiting Data Sparsity for Large-Scale Matrix Computations , 2018, Euro-Par.

[88]  Amy Nicole Langville,et al.  Google's PageRank and beyond - the science of search engine rankings , 2006 .

[89]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[90]  Scott Klasky,et al.  DataSpaces: an interaction and coordination framework for coupled simulation workflows , 2012, HPDC '10.

[91]  Hank Childs,et al.  Volume Rendering Via Data-Parallel Primitives , 2015, EGPGV@EuroVis.

[92]  Dominik Göddeke,et al.  Soft fault detection and correction for multigrid , 2018, Int. J. High Perform. Comput. Appl..

[93]  Nicholas J. Wright,et al.  A programming model performance study using the NAS parallel benchmarks , 2010 .

[94]  C. T. Vaughan,et al.  Evaluating the Marvell ThunderX2 Server Processor for HPC Workloads , 2019, 2019 International Conference on High Performance Computing & Simulation (HPCS).

[95]  Van H. Vu,et al.  Generating Random Regular Graphs , 2003, STOC '03.

[96]  Steffen Börm,et al.  Data-sparse Approximation by Adaptive ℋ2-Matrices , 2002, Computing.

[97]  Xingfu Wu,et al.  Using Processor Partitioning to Evaluate the Performance of MPI, OpenMP and Hybrid Parallel Applications on Dual- and Quad-core Cray XT4 Systems , 2009 .

[98]  Fan Zhang,et al.  In‐memory staging and data‐centric task placement for coupled scientific simulation workflows , 2017, Concurr. Comput. Pract. Exp..

[99]  Onur Mutlu,et al.  Architecting phase change memory as a scalable dram alternative , 2009, ISCA '09.

[100]  Christopher P. Stone,et al.  Accelerating the multi-zone scalar pentadiagonal CFD algorithm with OpenACC , 2015, WACCPD '15.

[101]  Michael Carbin,et al.  Ithemal: Accurate, Portable and Fast Basic Block Throughput Estimation using Deep Neural Networks , 2018, ICML.

[102]  Karsten Schwan,et al.  Data tiering in heterogeneous memory systems , 2016, EuroSys.

[103]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[104]  Franck Cappello,et al.  FTI: High performance Fault Tolerance Interface for hybrid systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[105]  William H. Lee,et al.  Multimessenger observations of a flaring blazar coincident with high-energy neutrino IceCube-170922A , 2018, Science.

[106]  Roberto Biasi,et al.  FPGA based microserver for high performance real-time computing in Adaptive Optics , 2017 .

[107]  Helmar Burkhart,et al.  PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[108]  Tobias Grosser,et al.  Efficient hierarchical online-autotuning: a case study on polyhedral accelerator mapping , 2019, ICS.

[109]  Christian Engelmann,et al.  Scaling to a million cores and beyond: Using light-weight simulation to understand the challenges ahead on the road to exascale , 2014, Future Gener. Comput. Syst..

[110]  Martin Schulz,et al.  Adaptive Configuration Selection for Power-Constrained Heterogeneous Systems , 2014, 2014 43rd International Conference on Parallel Processing.

[111]  Onur Mutlu,et al.  GateKeeper: a new hardware architecture for accelerating pre‐alignment in DNA short read mapping , 2016, Bioinform..

[112]  Vivek Sarkar,et al.  A survey of sparse matrix-vector multiplication performance on large matrices , 2016, ArXiv.

[113]  Sayantan Sur,et al.  Design and Evaluation of Generalized Collective Communication Primitives with Overlap Using ConnectX-2 Offload Engine , 2010, 2010 18th IEEE Symposium on High Performance Interconnects.

[114]  Vincent Michau,et al.  Adaptive optics: interaction matrix measurements and real time control algorithms for the COME-ON project , 1990, Other Conferences.

[115]  Kimberly Keeton,et al.  The Machine: An Architecture for Memory-centric Computing , 2015, ROSS@HPDC.

[116]  Gerhard Wellein,et al.  An Analysis of Core- and Chip-Level Architectural Features in Four Generations of Intel Server Processors , 2017, ISC.

[117]  Dhabaleswar K. Panda,et al.  Adaptive and Dynamic Design for MPI Tag Matching , 2016, 2016 IEEE International Conference on Cluster Computing (CLUSTER).

[118]  Dmitry Chirkin,et al.  Photon tracking with GPUs in IceCube , 2013 .

[119]  Yutaka Ishikawa,et al.  Hardware Performance Variation: A Comparative Study Using Lightweight Kernels , 2018, ISC.

[120]  E. Nyström Über Die Praktische Auflösung von Integralgleichungen mit Anwendungen auf Randwertaufgaben , 1930 .

[121]  Ron A. Oldfield,et al.  Evaluation of methods to integrate analysis into a large-scale shock shock physics code , 2014, ICS '14.

[122]  Rong Ge,et al.  Application-Aware Power Coordination on Power Bounded NUMA Multicore Systems , 2017, 2017 46th International Conference on Parallel Processing (ICPP).

[123]  R. Vanderwijngaart,et al.  NAS Parallel Benchmarks, Multi-Zone Versions , 2003 .

[124]  Prasanna Balaprakash,et al.  Autotuning in High-Performance Computing Applications , 2018, Proceedings of the IEEE.

[125]  A. Lumsdaine,et al.  LogGOPSim: simulating large-scale applications in the LogGOPS model , 2010, HPDC '10.

[126]  Jack Dongarra,et al.  Applied Mathematics Research for Exascale Computing , 2014 .

[127]  Michael A. Heroux,et al.  Toward Local Failure Local Recovery Resilience Model using MPI-ULFM , 2014, EuroMPI/ASIA.

[128]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[129]  Edward G. Coffman,et al.  A study of interleaved memory systems , 1970, AFIPS '70 (Spring).

[130]  Subhash Saini,et al.  Performance Evaluation of Intel Broadwell Nodes Based Supercomputer Using Computational Fluid Dynamics and Climate Applications , 2017, 2017 IEEE 19th International Conference on High Performance Computing and Communications Workshops (HPCCWS).

[131]  John F. Roddick,et al.  A Survey of Temporal Knowledge Discovery Paradigms and Methods , 2002, IEEE Trans. Knowl. Data Eng..

[132]  Alexander Sorkine-Hornung,et al.  Cache-efficient graph cuts on structured grids , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[133]  Jinsuk Chung,et al.  Containment domains: A scalable, efficient, and flexible resilience scheme for exascale systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[134]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[135]  Andreas Bartel,et al.  Numerical Techniques for Different Time Scales in Electric Circuit Simulation , 2002 .

[136]  Tony Pan,et al.  Optimizing High Performance Distributed Memory Parallel Hash Tables for DNA k-mer Counting , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[137]  C. W. Glass,et al.  Performance Modeling of the HPCG Benchmark , 2014, PMBS@SC.

[138]  Gerhard Wellein,et al.  A Recursive Algebraic Coloring Technique for Hardware-efficient Symmetric Sparse Matrix-vector Multiplication , 2019, ACM Trans. Parallel Comput..

[139]  Rolf Apweiler,et al.  The European Bioinformatics Institute in 2018: tools, infrastructure and training , 2018, Nucleic Acids Res..

[140]  Vladimir Getov,et al.  PMPI: High-Level Message Passing in Fortran 77 and C , 1997, HPCN Europe.

[141]  Andrew S. Grimshaw,et al.  High-Performance and Scalable GPU Graph Traversal , 2015, ACM Trans. Parallel Comput..

[142]  Ryan E. Grant,et al.  A Dedicated Message Matching Mechanism for Collective Communications , 2018, ICPP Workshops.

[143]  Jannis Klinkenberg,et al.  CHAMELEON: Reactive Load Balancing for Hybrid MPI+OpenMP Task-Parallel Applications , 2020, J. Parallel Distributed Comput..

[144]  Eamonn J. Keogh,et al.  Accelerating Dynamic Time Warping Subsequence Search with GPUs and FPGAs , 2010, 2010 IEEE International Conference on Data Mining.

[145]  Nectarios Koziris,et al.  Reliable and Efficient Performance Monitoring in Linux , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[146]  Quoc V. Le,et al.  GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , 2018, ArXiv.

[147]  Vivek Sarkar,et al.  Implementation and Evaluation of OpenSHMEM Contexts Using OFI Libfabric , 2017, OpenSHMEM.

[148]  Dimitrios M. Thilikos,et al.  Faster parameterized algorithms for minor containment , 2010, Theor. Comput. Sci..

[149]  Hiroshi Nakamura,et al.  An intra-task dvfs technique based on statistical analysis of hardware events , 2007, CF '07.

[150]  David E. Keyes,et al.  Multidimensional Intratile Parallelization for Memory-Starved Stencil Computations , 2015, ACM Trans. Parallel Comput..

[151]  Eamonn J. Keogh,et al.  Domain agnostic online semantic segmentation for multi-dimensional time series , 2018, Data Mining and Knowledge Discovery.

[152]  Allen D. Malony,et al.  Checkpoint/restart approaches for a thread-based MPI runtime , 2019, Parallel Comput..

[153]  Thomas Hérault,et al.  A failure detector for HPC platforms , 2018, Int. J. High Perform. Comput. Appl..

[154]  Dong Li,et al.  Runtime Data Management on Non-Volatile Memory-based Heterogeneous Memory for Task-Parallel Programs , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[155]  Vladimir Kolmogorov,et al.  What energy functions can be minimized via graph cuts? , 2002, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[156]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[157]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[158]  Brian Friesen,et al.  In situ and in-transit analysis of cosmological simulations , 2016, Computational astrophysics and cosmology.

[159]  Anthony Skjellum,et al.  Failure recovery for bulk synchronous applications with MPI stages , 2019, Parallel Comput..

[160]  Mari Ostendorf,et al.  Classification by Augmenting the Bag-of-Words Representation with Redundancy-Compensated Bigrams ∗ , 2005 .

[161]  Yves Robert,et al.  A Performance Model to Execute Workflows on High-Bandwidth-Memory Architectures , 2018, ICPP.

[162]  Martin Schulz,et al.  Pattern-Aware Staging for Hybrid Memory Systems , 2020, ISC.

[163]  Jure Leskovec,et al.  Inductive Representation Learning on Large Graphs , 2017, NIPS.

[164]  Thomas Hérault,et al.  MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[165]  Per-Gunnar Martinsson,et al.  A high-order accurate accelerated direct solver for acoustic scattering from surfaces , 2013 .

[166]  Edward M. Reingold,et al.  Graph drawing by force‐directed placement , 1991, Softw. Pract. Exp..

[167]  Andrew James McCall,et al.  Multi-level Parallelism with MPI and OpenACC for CFD Applications , 2017 .

[168]  Thomas Hérault,et al.  Practical scalable consensus for pseudo-synchronous distributed systems , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[169]  Robert Pincus,et al.  The CLAW DSL: Abstractions for Performance Portable Weather and Climate Models , 2018, PASC.

[170]  Martin Schulz,et al.  Evaluating and extending user-level fault tolerance in MPI applications , 2016, Int. J. High Perform. Comput. Appl..

[171]  Pavan Balaji,et al.  MPI+ULT: Overlapping Communication and Computation with User-Level Threads , 2015, 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems.

[172]  Leslie Greengard,et al.  Fast Direct Methods for Gaussian Processes , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[173]  Jason Duell,et al.  The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing , 2005, Int. J. High Perform. Comput. Appl..

[174]  M. W. Johnson,et al.  Phase transitions in a programmable quantum spin glass simulator , 2018, Science.

[175]  Gerhard Wellein,et al.  Analysis of Intel's Haswell Microarchitecture Using the ECM Model and Microbenchmarks , 2016, ARCS.

[176]  Hatem Ltaief,et al.  Extreme-Scale Task-Based Cholesky Factorization Toward Climate and Weather Prediction Applications , 2020, PASC.

[177]  Jan Fostier,et al.  elPrep 4: A multithreaded framework for sequence analysis , 2019, PloS one.

[178]  Marc Snir,et al.  Improving Strong-Scaling of CNN Training by Exploiting Finer-Grained Parallelism , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[179]  John J. Ottusch,et al.  Numerical Solution of the Helmholtz Equation in 2D and 3D Using a High-Order Nyström Discretization , 1998 .

[180]  R. F. Warming,et al.  An Implicit Factored Scheme for the Compressible Navier-Stokes Equations , 1977 .

[181]  Michael J. Levenhagen,et al.  The Case for Semi-Permanent Cache Occupancy: Understanding the Impact of Data Locality on Network Processing , 2018, ICPP.

[182]  Dhabaleswar K. Panda,et al.  Designing Non-blocking Personalized Collectives with Near Perfect Overlap for RDMA-Enabled Clusters , 2015, ISC.

[183]  Manjunath Gorentla Venkata,et al.  ConnectX-2 CORE-Direct Enabled Asynchronous Broadcast Collective Communications , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[184]  Jun Hu,et al.  A Butterfly-Based Direct Integral-Equation Solver Using Hierarchical LU Factorization for Analyzing Scattering From Electrically Large Conducting Objects , 2016, IEEE Transactions on Antennas and Propagation.

[185]  Jack J. Dongarra,et al.  Collecting Performance Data with PAPI-C , 2009, Parallel Tools Workshop.

[186]  Philip H. Carns,et al.  Tools for Analyzing Parallel I/O , 2018, ISC Workshops.

[187]  Dhabaleswar K. Panda,et al.  EReinit: Scalable and efficient fault‐tolerance for bulk‐synchronous MPI applications , 2018, Concurr. Comput. Pract. Exp..

[188]  Gerhard Wellein,et al.  Kerncraft: A Tool for Analytic Performance Modeling of Loop Kernels , 2017, ArXiv.

[189]  Leslie Greengard,et al.  A fast algorithm for particle simulations , 1987 .

[190]  Manuel Calvo,et al.  Short note: a new minimum storage Runge-Kutta scheme for computational acoustics , 2004 .

[191]  Onur Mutlu,et al.  GRIM-Filter: Fast seed location filtering in DNA read mapping using processing-in-memory technologies , 2017, BMC Genomics.

[192]  Xiao Liu,et al.  Basic Performance Measurements of the Intel Optane DC Persistent Memory Module , 2019, ArXiv.

[193]  Luc Gilles,et al.  Robustness study of the pseudo open-loop controller for multiconjugate adaptive optics. , 2005, Applied optics.

[194]  David Poliakoff,et al.  Gotcha: An Function-Wrapping Interface for HPC Tools , 2017, ESPT/VPA@SC.

[195]  Omer Subasi,et al.  Designing and Modelling Selective Replication for Fault-Tolerant HPC Applications , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[196]  Dejan S. Milojicic,et al.  SpaceJMP: Programming with Multiple Virtual Address Spaces , 2016, ASPLOS.

[197]  Georges Hébrail,et al.  Searching time series with Hadoop in an electric power company , 2013, BigMine '13.

[198]  Jack Dongarra,et al.  LAPACK: a portable linear algebra library for high-performance computers , 1990, SC.

[199]  Aidan Roy,et al.  Next-Generation Topology of D-Wave Quantum Processors , 2020, 2003.00133.

[200]  M. Sipser,et al.  Quantum Computation by Adiabatic Evolution , 2000, quant-ph/0001106.

[201]  Zaiping Nie,et al.  An MPI-OpenMP Hybrid Parallel -LU Direct Solver for Electromagnetic Integral Equations , 2015 .

[202]  Ricardo Bianchini,et al.  Page placement in hybrid memory systems , 2011, ICS '11.

[203]  Rolf Riesen,et al.  Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing , 2012, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[204]  Eamonn J. Keogh,et al.  Matrix Profile I: All Pairs Similarity Joins for Time Series: A Unifying View That Includes Motifs, Discords and Shapelets , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[205]  Ulrich Rüde,et al.  A scalable and extensible checkpointing scheme for massively parallel simulations , 2019, Int. J. High Perform. Comput. Appl..

[206]  Scott Klasky,et al.  Loosely Coupled In Situ Visualization: A Perspective on Why It's Here to Stay , 2015, ISAV@SC.

[207]  Luca Benini,et al.  Predictive Modeling for Job Power Consumption in HPC Systems , 2016, ISC.

[208]  Ponnuthurai N. Suganthan,et al.  Recent advances in differential evolution - An updated survey , 2016, Swarm Evol. Comput..

[209]  Erwin Laure,et al.  Idle waves in high-performance computing. , 2015, Physical review. E, Statistical, nonlinear, and soft matter physics.

[210]  Hiroshi Nakamura,et al.  Profile-based power shifting in interconnection networks with on/off links , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[211]  Gerhard Wellein,et al.  Performance Engineering of the Kernel Polynomal Method on Large-Scale CPU-GPU Systems , 2014, 2015 IEEE International Parallel and Distributed Processing Symposium.

[212]  Olivier Lezoray,et al.  Image Processing and Analysis With Graphs: theory and Practice , 2017 .

[213]  Barton P. Miller,et al.  Anywhere, any-time binary instrumentation , 2011, PASTE '11.

[214]  Robert Schöne,et al.  Main memory and cache performance of intel sandy bridge and AMD bulldozer , 2014, MSPC@PLDI.

[215]  Dmitriy Morozov,et al.  Master of Puppets: Cooperative Multitasking for In Situ Processing , 2016, HPDC.

[216]  Albert,et al.  Emergence of scaling in random networks , 1999, Science.

[217]  Peter Arbenz,et al.  A fault tolerant implementation of Multi-Level Monte Carlo methods , 2013, PARCO.

[218]  Allen D. Malony,et al.  Transparent High-Speed Network Checkpoint/Restart in MPI , 2018, EuroMPI.

[219]  Brandon Posey,et al.  On-Demand Urgent High Performance Computing Utilizing the Google Cloud Platform , 2019, 2019 IEEE/ACM HPC for Urgent Decision Making (UrgentHPC).

[220]  S. Börm Efficient Numerical Methods for Non-local Operators , 2010 .

[221]  Gabriel Pfeilschifter,et al.  Time Series Analysis with Matrix Profile on HPC Systems , 2019 .

[222]  Sudhakar Yalamanchili,et al.  General-purpose join algorithms for large graph triangle listing on heterogeneous systems , 2016, GPGPU@PPoPP.

[223]  Nuria Losada,et al.  Resilient MPI applications using an application-level checkpointing framework and ULFM , 2016, The Journal of Supercomputing.

[224]  Laxmikant V. Kalé,et al.  Maximizing Throughput of Overprovisioned HPC Data Centers Under a Strict Power Budget , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[225]  Jian-Ming Jin,et al.  A novel grid-robust higher order vector basis function for the method of moments , 2000 .

[226]  Alan Wagner,et al.  FG-MPI: Fine-grain MPI for multicore and clusters , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[227]  Marc Gendron-Bellemare,et al.  Fast, Scalable Algorithms for Reinforcement Learning in High Dimensional Domains , 2013 .

[228]  P. Yla-Oijala,et al.  Singularity subtraction technique for high-order polynomial vector basis functions on planar triangles , 2006, IEEE Transactions on Antennas and Propagation.

[229]  Shengen Yan,et al.  Optimizing Network Performance for Distributed DNN Training on GPU Clusters: ImageNet/AlexNet Training in 1.5 Minutes , 2019, ArXiv.

[230]  Matthew Poremba,et al.  Design and Analysis of an APU for Exascale Computing , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[231]  Nathan Halko,et al.  Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..

[232]  Talita Perciano,et al.  DPP-PMRF: Rethinking Optimization for a Probabilistic Graphical Model Using Data-Parallel Primitives , 2018, 2018 IEEE 8th Symposium on Large Data Analysis and Visualization (LDAV).

[233]  Robert A. van de Geijn,et al.  Programming matrix algorithms-by-blocks for thread-level parallelism , 2009, TOMS.

[234]  Ryan E. Grant,et al.  Fuzzy Matching: Hardware Accelerated MPI Communication Middleware , 2019, 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[235]  Alex Brooks,et al.  Argobots: A Lightweight Low-Level Threading and Tasking Framework , 2018, IEEE Transactions on Parallel and Distributed Systems.

[236]  Torsten Hoefler,et al.  Kernel-Based Offload of Collective Operations - Implementation, Evaluation and Lessons Learned , 2011, Euro-Par.

[237]  Kerstin Kleese van Dam,et al.  Management, analysis, and visualization of experimental and observational data — The convergence of data and computing , 2016, 2016 IEEE 12th International Conference on e-Science (e-Science).

[238]  Manish Parashar,et al.  Local recovery and failure masking for stencil-based applications at extreme scales , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[239]  Talita Perciano,et al.  Distributed memory parallel Markov random fields using graph partitioning , 2017, 2017 IEEE International Conference on Big Data (Big Data).

[240]  John D. Owens,et al.  A Dynamic Hash Table for the GPU , 2017, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[241]  Jannis Klinkenberg,et al.  Hybrid MPI+OpenMP Reactive Work Stealing in Distributed Memory in the PDE Framework sam(oa)^2 , 2018, 2018 IEEE International Conference on Cluster Computing (CLUSTER).

[242]  Julian M. Kunkel,et al.  Tracking User-Perceived I/O Slowdown via Probing , 2019, ISC Workshops.

[243]  Richard E. Ewing,et al.  High-Precision BLAS on FPGA-enhanced Computers , 2007, ERSA.

[244]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[245]  Gerhard Wellein,et al.  Delay Flow Mechanisms on Clusters , 2019 .

[246]  L. Chua Memristor-The missing circuit element , 1971 .

[247]  Zizhong Chen,et al.  Process Fault Tolerance: Semantics, Design and Applications for High Performance Computing , 2005, Int. J. High Perform. Comput. Appl..

[248]  Gustavo Alonso,et al.  Multi-Core, Main-Memory Joins: Sort vs. Hash Revisited , 2013, Proc. VLDB Endow..

[249]  Nicholas J. Higham,et al.  Matlab guide , 2000 .

[250]  Raymond Namyst,et al.  MPC: A Unified Parallel Runtime for Clusters of NUMA Machines , 2008, Euro-Par.

[251]  Ahmad Afsahi,et al.  Communication‐aware message matching in MPI , 2018, Concurr. Comput. Pract. Exp..

[252]  Bernhard Scholz,et al.  Soufflé: On Synthesis of Program Analyzers , 2016, CAV.

[253]  Michael E. Papka,et al.  Optimal Execution of Co-analysis for Large-Scale Molecular Dynamics Simulations , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[254]  Jason Duell,et al.  Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters , 2006 .

[255]  Seth Lloyd,et al.  Adiabatic quantum computation is equivalent to standard quantum computation , 2004, 45th Annual IEEE Symposium on Foundations of Computer Science.

[256]  Gerhard Wellein,et al.  Propagation and Decay of Injected One-Off Delays on Clusters: A Case Study , 2019, 2019 IEEE International Conference on Cluster Computing (CLUSTER).

[257]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[258]  Mikhail L. Zymbler,et al.  Time Series Subsequence Similarity Search Under Dynamic Time Warping Distance on the Intel Many-core Accelerators , 2015, SISAP.

[259]  Richard Barrett,et al.  Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods , 1994, Other Titles in Applied Mathematics.

[260]  Tian Jin,et al.  Efficient Fork-Join on GPUs Through Warp Specialization , 2017, 2017 IEEE 24th International Conference on High Performance Computing (HiPC).

[261]  Karthick Rajamani,et al.  A performance-conserving approach for reducing peak power consumption in server systems , 2005, ICS '05.

[262]  James Bremer,et al.  A Nyström method for weakly singular integral operators on surfaces , 2012, J. Comput. Phys..

[263]  Felix Wolf,et al.  Understanding the Scalability of Molecular Simulation Using Empirical Performance Modeling , 2018, ESPT/VPA@SC.

[264]  Alex Yu. Yeremin,et al.  Matrix-free iterative solution strategies for large dense linear systems , 1997, Numer. Linear Algebra Appl..

[265]  Nan Ding,et al.  An Instruction Roofline Model for GPUs , 2019, 2019 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS).

[266]  Pavan Balaji,et al.  Lessons Learned Implementing User-Level Failure Mitigation in MPICH , 2015, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[267]  Keith D. Underwood,et al.  Evaluation of an Eager Protocol Optimization for MPI , 2003, PVM/MPI.

[268]  Robin Thomas,et al.  On the complexity of finding iso- and other morphisms for partial k-trees , 1992, Discret. Math..

[269]  Serge Abiteboul,et al.  Foundations of Databases: The Logical Level , 1995 .

[270]  Kunle Olukotun,et al.  Accelerating CUDA graph algorithms at maximum warp , 2011, PPoPP '11.

[271]  Gerhard Wellein,et al.  High-performance implementation of Chebyshev filter diagonalization for interior eigenvalue computations , 2015, J. Comput. Phys..

[272]  D. Quinlan,et al.  ROSE: Compiler Support for Object-Oriented Frameworks , 1999, Parallel Process. Lett..

[273]  John Shalf,et al.  The International Exascale Software Project roadmap , 2011, Int. J. High Perform. Comput. Appl..

[274]  Dmitry Pekurovsky,et al.  P3DFFT: A Framework for Parallel Computations of Fourier Transforms in Three Dimensions , 2012, SIAM J. Sci. Comput..

[275]  Lizy Kurian John,et al.  Minimalist open-page: A DRAM page-mode scheduling policy for the many-core era , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[276]  Eamonn J. Keogh,et al.  Scaling Time Series Motif Discovery with GPUs : Breaking the Quintillion Pairwise Comparisons a Day Barrier , 2018 .

[277]  Y. Saad,et al.  GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems , 1986 .

[278]  Michael Bader,et al.  Influence of A-Posteriori Subcell Limiting on Fault Frequency in Higher-Order DG Schemes , 2018, 2018 IEEE/ACM 8th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS).

[279]  Alejandro Duran,et al.  YASK—Yet Another Stencil Kernel: A Framework for HPC Stencil Code-Generation and Tuning , 2016, 2016 Sixth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing (WOLFHPC).

[280]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[281]  Steven L Salzberg,et al.  Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype , 2019, Nature Biotechnology.

[282]  Goran Flegar,et al.  Overcoming Load Imbalance for Irregular Sparse Matrices , 2017, IA3@SC.

[283]  Bronis R. de Supinski,et al.  Exascale Algorithms for Generalized MPI_Comm_split , 2011, EuroMPI.

[284]  Travis S. Humble,et al.  Adiabatic quantum programming: minor embedding with hard faults , 2012, Quantum Information Processing.

[285]  Pongsakorn U.-Chupala,et al.  ImageNet/ResNet-50 Training in 224 Seconds , 2018, ArXiv.

[286]  Thomas Rauber,et al.  Applicability of the ECM Performance Model to Explicit ODE Methods on Current Multi-core Processors , 2018, ISC.

[287]  Hari Subramoni,et al.  Design and Characterization of InfiniBand Hardware Tag Matching in MPI , 2020, 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID).

[288]  David E. Keyes,et al.  Tile Low Rank Cholesky Factorization for Climate/Weather Modeling Applications on Manycore Architectures , 2017, ISC.

[289]  Kenneth Moreland,et al.  Techniques for data-parallel searching for duplicate elements , 2017, 2017 IEEE 7th Symposium on Large Data Analysis and Visualization (LDAV).

[290]  Giuseppe Di Fatta,et al.  Scalable and Fault Tolerant Failure Detection and Consensus , 2015, EuroMPI.

[291]  Matthias Ganzinger,et al.  Alignment of High-Throughput Sequencing Data Inside In-Memory Databases , 2014, MIE.

[292]  Hristo Djidjev,et al.  Solving large minimum vertex cover problems on a quantum annealer , 2019, CF.

[293]  Bingsheng He,et al.  Efficient gather and scatter operations on graphics processors , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[294]  Vicky Choi,et al.  Minor-embedding in adiabatic quantum computation: II. Minor-universal graph design , 2010, Quantum Inf. Process..

[295]  Nikhil R. Devanur,et al.  PipeDream: Fast and Efficient Pipeline Parallel DNN Training , 2018, ArXiv.

[296]  Hiroshi Nakamura,et al.  Power capping of CPU-GPU heterogeneous systems through coordinating DVFS and task mapping , 2013, 2013 IEEE 31st International Conference on Computer Design (ICCD).

[297]  A. Y. Suhov An Accurate Polynomial Approximation of Exponential Integrators , 2014, J. Sci. Comput..

[298]  Aaron Vose,et al.  Programming for Hybrid Multi/Manycore MPP Systems , 2017 .

[299]  Stephan Eidenbenz,et al.  Deterministic Preparation of Dicke States , 2019, FCT.

[300]  Emina Torlak,et al.  Kodkod: A Relational Model Finder , 2007, TACAS.

[301]  Hartwig Anzt,et al.  Sparse Linear Algebra on AMD and NVIDIA GPUs – The Race Is On , 2020, ISC.

[302]  Axel Auweter,et al.  From facility to application sensor data: modular, continuous and holistic monitoring with DCDB , 2019, SC.

[303]  Jacob Hemstad,et al.  ISx: A Scalable Integer Sort for Co-design in the Exascale Era , 2015, 2015 9th International Conference on Partitioned Global Address Space Programming Models.

[304]  Michael Garland,et al.  Optimizing Sparse Matrix Operations on GPUs Using Merge Path , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[305]  Till Westmann,et al.  On fast large-scale program analysis in Datalog , 2016, CC.

[306]  Laxmikant V. Kalé,et al.  NAMD: Biomolecular Simulation on Thousands of Processors , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[307]  Aamer Jaleel,et al.  Adaptive insertion policies for high performance caching , 2007, ISCA '07.

[308]  Andreas Eckner,et al.  Algorithms for Unevenly Spaced Time Series : Moving Averages and Other Rolling Operators , 2015 .

[309]  David Ozog,et al.  Lightweight Instrumentation and Analysis Using OpenSHMEM Performance Counters , 2018, OpenSHMEM.

[310]  Matthias Becker,et al.  Accelerated Genomics Data Processing using Memory-Driven Computing , 2019, 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[311]  Philipp Samfass,et al.  Lightweight task offloading exploiting MPI wait times for parallel adaptive mesh refinement , 2020, Concurr. Comput. Pract. Exp..

[312]  Onur Mutlu,et al.  Research Problems and Opportunities in Memory Systems , 2014, Supercomput. Front. Innov..

[313]  Jeffrey S. Vetter,et al.  NVIDIA Tensor Core Programmability, Performance & Precision , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[314]  Michael Dumbser,et al.  ExaHyPE: An Engine for Parallel Dynamically Adaptive Simulations of Wave Problems , 2019, Comput. Phys. Commun..

[315]  Martin Schulz,et al.  Production Hardware Overprovisioning: Real-World Performance Optimization Using an Extensible Power-Aware Resource Management Framework , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[316]  David E. Keyes,et al.  KBLAS: An Optimized Library for Dense Matrix-Vector Multiplication on GPU Accelerators , 2014, ACM Trans. Math. Softw..

[317]  Karsten Schwan,et al.  I/O Containers: Managing the Data Analytics and Visualization Pipelines of High End Codes , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[318]  Yuri Boykov,et al.  A Scalable graph-cut algorithm for N-D grids , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[319]  Robert Latham,et al.  Understanding and improving computational science storage access through continuous characterization , 2011, 2011 IEEE 27th Symposium on Mass Storage Systems and Technologies (MSST).

[320]  Torsten Hoefler,et al.  Message progression in parallel computing - to thread or not to thread? , 2008, 2008 IEEE International Conference on Cluster Computing.

[321]  S. M. Ghazimirsaeed,et al.  Accelerating MPI Message Matching by a Data Clustering Strategy , 2017 .

[322]  Emmanuel Agullo,et al.  Comparative study of one-sided factorizations with multiple software packages on multi-core hardware , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[323]  Dean M. Tullsen,et al.  Inter-core prefetching for multicore processors using migrating helper threads , 2011, ASPLOS XVI.

[324]  Jaejin Lee,et al.  Prefetching with Helper Threads for Loosely Coupled Multiprocessor Systems , 2009, IEEE Transactions on Parallel and Distributed Systems.

[325]  Eric S. Chung,et al.  Towards a Universal FPGA Matrix-Vector Multiplication Architecture , 2012, 2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines.

[326]  Thomas Hérault,et al.  Failure Detection and Propagation in HPC systems , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[327]  Vicky Choi,et al.  Minor-embedding in adiabatic quantum computation: I. The parameter setting problem , 2008, Quantum Inf. Process..

[328]  J. Parrón,et al.  Multiscale Compressed Block Decomposition for Fast Direct Solution of Method of Moments Linear System , 2011, IEEE Transactions on Antennas and Propagation.

[329]  Nicolai M. Josuttis The C++ Standard Library: A Tutorial and Reference , 2012 .

[330]  Thomas Gilray,et al.  Distributed Relational Algebra at Scale , 2019, 2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC).

[331]  Franck Cappello,et al.  Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities , 2009, Int. J. High Perform. Comput. Appl..

[332]  Michael Garland,et al.  Merge-Based Parallel Sparse Matrix-Vector Multiplication , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[333]  Hiroshi Nakamura,et al.  Immediate sleep: Reducing energy impact of peripheral circuits in STT-MRAM caches , 2015, 2015 33rd IEEE International Conference on Computer Design (ICCD).

[334]  Dong Li,et al.  Unimem: Runtime Data Management on Non-Volatile Memory-based Heterogeneous Main Memory , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[335]  Dhabaleswar K. Panda,et al.  OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training , 2018, 2018 IEEE 25th International Conference on High Performance Computing (HiPC).

[336]  Arie Shoshani,et al.  Hello ADIOS: the challenges and lessons of developing leadership class I/O frameworks , 2014, Concurr. Comput. Pract. Exp..

[337]  David W. Walker,et al.  Performance analysis of a hybrid MPI/OpenMP application on multi-core clusters , 2010, J. Comput. Sci..

[338]  William J. Dally,et al.  Architecting an Energy-Efficient DRAM System for GPUs , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[339]  Gerhard Wellein,et al.  Quantifying Performance Bottlenecks of Stencil Computations Using the Execution-Cache-Memory Model , 2014, ICS.

[340]  Thomas Hérault,et al.  Sliding Substitution of Failed Nodes , 2015, EuroMPI.

[341]  Jack Dongarra,et al.  On block-asynchronous execution on GPUs , 2016 .

[342]  Sayantan Sur,et al.  A Brief Introduction to the OpenFabrics Interfaces - A New Network API for Maximizing High Performance Application Efficiency , 2015, 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects.

[343]  Tak-Chung Fu,et al.  A review on time series data mining , 2011, Eng. Appl. Artif. Intell..

[344]  Paul Fischer,et al.  PROJECTION TECHNIQUES FOR ITERATIVE SOLUTION OF Ax = b WITH SUCCESSIVE RIGHT-HAND SIDES , 1993 .

[345]  Jorge Luis Rodriguez,et al.  The Open Science Grid , 2005 .

[346]  Dhabaleswar K. Panda,et al.  Designing Dynamic and Adaptive MPI Point-to-Point Communication Protocols for Efficient Overlap of Computation and Communication , 2017, ISC.

[347]  Thomas Hérault,et al.  Design for a Soft Error Resilient Dynamic Task-Based Runtime , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[348]  Adam Moody,et al.  System Noise Revisited: Enabling Application Scalability and Reproducibility with SMT , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[349]  Jesús Labarta,et al.  Improving the Interoperability between MPI and Task-Based Programming Models , 2018, EuroMPI.

[350]  A. Schukraft,et al.  The IceCube Neutrino Observatory: Instrumentation and Online Systems , 2016, 1612.05093.

[351]  Pavan Balaji,et al.  A Review of Lightweight Thread Approaches for High Performance Computing , 2016, 2016 IEEE International Conference on Cluster Computing (CLUSTER).

[352]  J. Demmel,et al.  Sun Microsystems , 1996 .

[353]  Julian M. Kunkel,et al.  Footprinting Parallel I/O - Machine Learning to Classify Application's I/O Behavior , 2019, ISC Workshops.

[354]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[355]  Mehmet Akif Ersoy,et al.  Parallelizing shortest path algorithm for time dependent graphs with flow speed model , 2016, 2016 IEEE 10th International Conference on Application of Information and Communication Technologies (AICT).

[356]  Fabian J. Theis,et al.  Scalable Parameter Estimation for Genome-Scale Biochemical Reaction Networks , 2016, bioRxiv.

[357]  Martin Schulz,et al.  Practical Resource Management in Power-Constrained, High Performance Computing , 2015, HPDC.

[358]  Sunggu Lee,et al.  Power management of hybrid DRAM/PRAM-based main memory , 2011, 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC).

[359]  Denis Trystram,et al.  Improving backfilling by using machine learning to predict running times , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[360]  Chris J. Scheiman,et al.  LogGP: incorporating long messages into the LogP model—one step closer towards a realistic model for parallel computation , 1995, SPAA '95.

[361]  Akinori Yonezawa,et al.  StackThreads/MP: integrating futures into calling standards , 1999, PPoPP '99.

[362]  Hiroshi Nakamura,et al.  7.2 4Mb STT-MRAM-based cache with memory-access-aware power optimization and write-verify-write / read-modify-write scheme , 2016, 2016 IEEE International Solid-State Circuits Conference (ISSCC).

[363]  Sven Rahmann,et al.  Genome analysis , 2022 .

[364]  Thomas F. Wenisch,et al.  CoScale: Coordinating CPU and Memory System DVFS in Server Systems , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[365]  Nicholas J. Higham,et al.  Accelerating the Solution of Linear Systems by Iterative Refinement in Three Precisions , 2018, SIAM J. Sci. Comput..

[366]  Stewart Taylor,et al.  Optimizing Applications for Multi-Core Processors, Using the Intel® Integrated Performance Primitives, Second Edition , 2007 .

[367]  Richard M. Karp,et al.  Reducibility Among Combinatorial Problems , 1972, 50 Years of Integer Programming.

[368]  Dhabaleswar K. Panda,et al.  High performance and reliable NIC-based multicast over Myrinet/GM-2 , 2003, 2003 International Conference on Parallel Processing, 2003. Proceedings..

[369]  Gerhard Wellein,et al.  Desynchronization and Wave Pattern Formation in MPI-Parallel and Hybrid Memory-Bound Programs , 2020, ISC.

[370]  Emmanuel Jeannot,et al.  Modeling Non-Uniform Memory Access on Large Compute Nodes with the Cache-Aware Roofline Model , 2019, IEEE Transactions on Parallel and Distributed Systems.

[371]  G. F. Miller,et al.  The application of integral equation methods to the numerical solution of some exterior boundary-value problems , 1971, Proceedings of the Royal Society of London. A. Mathematical and Physical Sciences.

[372]  Krishnamurthy Viswanathan,et al.  Billion node graph inference : iterative processing on The Machine , 2017 .

[373]  Martin Schulz,et al.  Exploring hardware overprovisioning in power-constrained, high performance computing , 2013, ICS '13.

[374]  James Demmel,et al.  Precimonious: Tuning assistant for floating-point precision , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[375]  David E. Keyes,et al.  High Performance Pseudo-analytical Simulation of Multi-Object Adaptive Optics over Multi-GPU Systems , 2014, Euro-Par.

[376]  Wolfgang Hackbusch,et al.  A Sparse Matrix Arithmetic Based on H-Matrices. Part I: Introduction to H-Matrices , 1999, Computing.

[377]  Courtenay T. Vaughan,et al.  ASC Tri-lab Co-design Level 2 Milestone Report 2015 , 2015 .

[378]  Gerhard Wellein,et al.  likwid-bench: An Extensible Microbenchmarking Platform for x86 Multicore Compute Nodes , 2011, Parallel Tools Workshop.

[379]  Stanimire Tomov,et al.  Load-balancing Sparse Matrix Vector Product Kernels on GPUs , 2020, ACM Trans. Parallel Comput..

[380]  Dhabaleswar K. Panda,et al.  System-Level Scalable Checkpoint-Restart for Petascale Computing , 2016, 2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS).

[381]  Harvey Richardson,et al.  LASSI: Metric Based I/O Analytics for HPC , 2019, 2019 Spring Simulation Conference (SpringSim).

[382]  Bernhard Scholz,et al.  A specialized B-tree for concurrent datalog evaluation , 2019, PPoPP.

[383]  Bernhard Scholz,et al.  Brie: A Specialized Trie for Concurrent Datalog , 2019, PMAM@PPoPP.

[384]  Rolf Riesen,et al.  See applications run and throughput jump: The case for redundant computing in HPC , 2010, 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W).

[385]  J Balcas,et al.  Pushing HTCondor and glideinWMS to 200K+ Jobs in a Global Pool for CMS before Run 2 , 2015 .

[386]  Michael Stumm,et al.  BigKernel -- High Performance CPU-GPU Communication Pipelining for Big Data-Style Applications , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[387]  Eduard Ayguadé,et al.  Overlapping communication and computation by using a hybrid MPI/SMPSs approach , 2010, ICS '10.

[388]  Giuseppe Di Fatta,et al.  Epidemic failure detection and consensus for extreme parallelism , 2018, Int. J. High Perform. Comput. Appl..

[389]  John David Funge Artificial Intelligence for Computer Games: An Introduction , 2004 .

[390]  Douglas Thain,et al.  Qthreads: An API for programming with millions of lightweight threads , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[391]  Christian Engelmann,et al.  The Case for Modular Redundancy in Large-Scale High Performance Computing Systems , 2009 .

[392]  Aidan Roy,et al.  A practical heuristic for finding graph minors , 2014, ArXiv.

[393]  James Dinan,et al.  Contexts: A Mechanism for High Throughput Communication in OpenSHMEM , 2014, PGAS.

[394]  Travis S. Humble,et al.  Optimizing adiabatic quantum program compilation using a graph-theoretic framework , 2017, Quantum Information Processing.

[395]  Jack Dongarra,et al.  Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .

[396]  Dhabaleswar K. Panda,et al.  A case for application-oblivious energy-efficient MPI runtime , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[397]  Jeffrey S. Vetter,et al.  Opportunities for Nonvolatile Memory Systems in Extreme-Scale High-Performance Computing , 2015, Computing in Science & Engineering.

[398]  Jun Wang,et al.  Discovering Multidimensional Motifs in Physiological Signals for Personalized Healthcare , 2016, IEEE Journal of Selected Topics in Signal Processing.

[399]  Masanori Bando,et al.  FlashLook: 100-Gbps hash-tuned route lookup architecture , 2009, 2009 International Conference on High Performance Switching and Routing.

[400]  Stephen A. Jarvis,et al.  CloverLeaf: Preparing Hydrodynamics Codes for Exascale , 2013 .

[401]  Stephen L. Olivier,et al.  Optimizing for KNL Usage Modes When Data Doesn't Fit in MCDRAM , 2018, ICPP.

[402]  J. Shaeffer,et al.  Direct Solve of Electrically Large Integral Equations for Problem Sizes to 1 M Unknowns , 2008, IEEE Transactions on Antennas and Propagation.

[403]  Ryan E. Grant,et al.  A dynamic, unified design for dedicated message matching engines for collective and point-to-point communications , 2019, Parallel Comput..

[404]  David E. Keyes,et al.  Extreme Scale FMM-Accelerated Boundary Integral Equation Solver for Wave Scattering , 2018, SIAM J. Sci. Comput..

[405]  Chao Yang,et al.  CAMERA: The Center for Advanced Mathematics for Energy Research Applications , 2015 .

[406]  Sharad Singhal,et al.  Adapting to Thrive in a New Economy of Memory Abundance , 2015, Computer.

[407]  Hank Childs,et al.  In Situ Visualization for Computational Science , 2019, IEEE Computer Graphics and Applications.

[408]  L FredmanMichael,et al.  Storing a Sparse Table with 0(1) Worst Case Access Time , 1984 .

[409]  Francesca Mazzia,et al.  Test Set for Initial Value Problem Solvers , 2003 .

[410]  Barbara M. Chapman,et al.  Towards Automatic HBM Allocation Using LLVM: A Case Study with Knights Landing , 2016, 2016 Third Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC).

[411]  James H. Laros,et al.  Evaluating the viability of process replication reliability for exascale systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[412]  Frank Wuerthwein,et al.  Characterizing network paths in and out of the clouds , 2020, ArXiv.

[413]  Talita Perciano,et al.  Maximal clique enumeration with data-parallel primitives , 2017, 2017 IEEE 7th Symposium on Large Data Analysis and Visualization (LDAV).

[414]  Anthony Di Franco,et al.  A comprehensive study of real-world numerical bug characteristics , 2017, 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[415]  A. Sevin,et al.  A novel fast and accurate pseudo-analytical simulation approach for MOAO , 2014, Astronomical Telescopes and Instrumentation.

[416]  Hal Finkel,et al.  ClangJIT: Enhancing C++ with Just-in-Time Compilation , 2019, 2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC).

[417]  Satoshi Matsuoka,et al.  DRAGON: Breaking GPU Memory Capacity Limits with Direct NVM Access , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[418]  Amith R. Mamidala,et al.  Scaling alltoall collective on multi-core systems , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[419]  Torsten Hoefler,et al.  Group Operation Assembly Language - A Flexible Way to Express Collective Communication , 2009, 2009 International Conference on Parallel Processing.

[420]  Rajeev Thakur,et al.  Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..

[421]  Wei Huang,et al.  Processor-Memory Power Shifting for Multi-Core Systems , 2012 .

[422]  Gustavo Alonso,et al.  Distributed Join Algorithms on Thousands of Cores , 2017, Proc. VLDB Endow..

[423]  Pradeep Dubey,et al.  Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-Core CPUs , 2009, Proc. VLDB Endow..

[424]  E. Hairer,et al.  Solving Ordinary Differential Equations II: Stiff and Differential-Algebraic Problems , 2010 .

[425]  Martin Schulz,et al.  Exploiting Data Similarity to Reduce Memory Footprints , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[426]  James Demmel,et al.  Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.

[427]  Georg Hager,et al.  Hybrid MPI/OpenMP Parallel Programming on Clusters of Multi-Core SMP Nodes , 2009, 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing.

[428]  Markus Geimer,et al.  Identifying the Root Causes of Wait States in Large-Scale Parallel Applications , 2010, 2010 39th International Conference on Parallel Processing.

[429]  Majid Sarrafzadeh,et al.  Toward Unsupervised Activity Discovery Using Multi-Dimensional Motif Detection in Time Series , 2009, IJCAI.

[430]  Viktor K. Prasanna,et al.  Designing scalable FPGA-based reduction circuits using pipelined floating-point cores , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[431]  Josef Weidendorfer,et al.  The Case for a Common Instrumentation Interface for HPC Codes , 2019, 2019 IEEE/ACM International Workshop on Programming and Performance Visualization Tools (ProTools).

[432]  C. Farhat,et al.  Extending substructure based iterative solvers to multiple load and repeated analyses , 1994 .

[433]  Jean-Yves L'Excellent,et al.  Improving Multifrontal Methods by Means of Block Low-Rank Representations , 2015, SIAM J. Sci. Comput..

[434]  Eamonn J. Keogh,et al.  Matrix Profile VI: Meaningful Multidimensional Motif Discovery , 2017, 2017 IEEE International Conference on Data Mining (ICDM).

[435]  Dhabaleswar K. Panda,et al.  S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters , 2017, PPoPP.

[436]  Ananta Tiwari,et al.  Online Adaptive Code Generation and Tuning , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[437]  Bryan O'Gorman,et al.  Generalized swap networks for near-term quantum computing , 2019, ArXiv.

[438]  Alexander Aiken,et al.  Beyond Data and Model Parallelism for Deep Neural Networks , 2018, SysML.

[439]  Carl Staelin,et al.  lmbench: Portable Tools for Performance Analysis , 1996, USENIX Annual Technical Conference.

[440]  Scott Klasky,et al.  Predicting Output Performance of a Petascale Supercomputer , 2017, HPDC.

[441]  Daniel Sunderland,et al.  Kokkos: Enabling manycore performance portability through polymorphic memory access patterns , 2014, J. Parallel Distributed Comput..

[442]  Hank Childs,et al.  Ray tracing within a data parallel framework , 2015, 2015 IEEE Pacific Visualization Symposium (PacificVis).

[443]  Berenger Bramas A Novel Hybrid Quicksort Algorithm Vectorized using AVX-512 on Intel Skylake , 2017 .

[444]  Mark W. Johnson,et al.  Observation of topological phenomena in a programmable lattice of 1,800 qubits , 2018, Nature.

[445]  Charles Kristopher Garrett,et al.  The Darwin Cluster , 2018 .

[446]  T. Pulliam,et al.  A diagonal form of an implicit approximate-factorization algorithm , 1981 .

[447]  Sergei Gorlatch,et al.  ATF: A generic directive‐based auto‐tuning framework , 2019, Concurr. Comput. Pract. Exp..

[448]  Maria Kotsifakou,et al.  A GPU implementation of tiled belief propagation on Markov Random Fields , 2013, 2013 Eleventh ACM/IEEE International Conference on Formal Methods and Models for Codesign (MEMOCODE 2013).

[449]  Michael Garland,et al.  Implementing sparse matrix-vector multiplication on throughput-oriented processors , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[450]  George Michelogiannakis,et al.  The Pitfalls of Provisioning Exascale Networks: A Trace Replay Analysis for Understanding Communication Performance , 2018, ISC.

[451]  Stefano Ceri,et al.  An Overview of Parallel Strategies for Transitive Closure on Algebraic Machines , 1990, PRISMA Workshop.

[452]  Gerhard Wellein,et al.  CRAFT: A Library for Easier Application-Level Checkpoint/Restart and Automatic Fault Tolerance , 2017, IEEE Transactions on Parallel and Distributed Systems.

[453]  F. Petrini,et al.  The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[454]  Ada Gavrilovska,et al.  Kleio: A Hybrid Memory Page Scheduler with Machine Intelligence , 2019, HPDC.

[455]  Frantz Martinache,et al.  The compute and control for adaptive optics (CACAO) real-time control software package , 2018, Astronomical Telescopes + Instrumentation.

[456]  Scott Klasky,et al.  Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[457]  Tajana Simunic,et al.  PDRAM: A hybrid PRAM and DRAM main memory system , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[458]  Georg Hager,et al.  On the accuracy and usefulness of analytic energy models for contemporary multicore processors , 2018, ISC.

[459]  Satoshi Matsuoka,et al.  From FLOPS to BYTES: disruptive change in high-performance computing towards the post-moore era , 2016, Conf. Computing Frontiers.

[460]  Dirk Schmidl,et al.  Score-P: A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir , 2011, Parallel Tools Workshop.

[461]  Martin Schulz,et al.  I/O Aware Power Shifting , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[462]  Gerhard Wellein,et al.  Bridging the Architecture Gap: Abstracting Performance-Relevant Properties of Modern Server Processors , 2020, Supercomput. Front. Innov..

[463]  Tobias Weinzierl,et al.  Enclave Tasking for Discontinuous Galerkin Methods on Dynamically Adaptive Meshes , 2018, SIAM J. Sci. Comput..

[464]  Christos Faloutsos,et al.  Realistic, Mathematically Tractable Graph Generation and Evolution, Using Kronecker Multiplication , 2005, PKDD.

[465]  Manjunath Gorentla Venkata,et al.  Parallelizing the Smith-Waterman Algorithm Using OpenSHMEM and MPI-3 One-Sided Interfaces , 2015, OpenSHMEM.

[466]  Subhash Saini,et al.  Performance Evaluation of an Intel Haswell-and Ivy Bridge-Based Supercomputer Using Scientific and Engineering Applications , 2016, 2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS).

[467]  Dirk Ribbrock,et al.  Fault-tolerant finite-element multigrid algorithms with hierarchically compressed asynchronous checkpointing , 2015, Parallel Comput..

[468]  Kwan-Liu Ma,et al.  VTK-m: Accelerating the Visualization Toolkit for Massively Threaded Architectures , 2016, IEEE Computer Graphics and Applications.

[469]  Bronis R. de Supinski,et al.  Detailed Modeling and Evaluation of a Scalable Multilevel Checkpointing System , 2014, IEEE Transactions on Parallel and Distributed Systems.

[470]  Alan M. Frieze,et al.  Random graphs , 2006, SODA '06.

[471]  Eduard Ayguadé,et al.  Runtime-Guided Management of Stacked DRAM Memories in Task Parallel Programs , 2018, ICS.

[472]  Shoaib Kamil,et al.  OpenTuner: An extensible framework for program autotuning , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[473]  Dhabaleswar K. Panda,et al.  MPI performance engineering with the MPI tool interface: the integration of MVAPICH and TAU , 2017, EuroMPI/USA.

[474]  Andrea Borghesi,et al.  Scheduling-based power capping in high performance computing systems , 2018, Sustain. Comput. Informatics Syst..

[475]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[476]  Dhabaleswar K. Panda,et al.  Cooperative Rendezvous Protocols for Improved Performance and Overlap , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[477]  Jun Hu,et al.  Fast Direct Solution of Integral Equations With Modified HODLR Structure for Analyzing Electromagnetic Scattering Problems , 2019, IEEE Transactions on Antennas and Propagation.

[478]  Ravishankar K. Iyer,et al.  Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[479]  Constantine Bekas,et al.  Stochastic Matrix-Function Estimators: Scalable Big-Data Kernels with High Performance , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[480]  Michael Dumbser,et al.  Studies on the energy and deep memory behaviour of a cache-oblivious, task-based hyperbolic PDE solver , 2019, Int. J. High Perform. Comput. Appl..

[481]  Xiaorui Wang,et al.  Power capping: a prelude to power shifting , 2008, Cluster Computing.

[482]  Per-Gunnar Martinsson,et al.  An O(N) Direct Solver for Integral Equations on the Plane , 2013, 1303.5466.

[483]  Torsten Hoefler,et al.  Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis. , 2018 .

[484]  Nir Friedman,et al.  Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning , 2009 .