PFunc: modern task parallelism for modern high performance computing

HPC today faces new challenges due to paradigm shifts in both hardware and software. The ubiquity of multi-cores, many-cores, and GPGPUs is forcing traditional serial as well as distributed-memory parallel applications to be parallelized for these architectures. Emerging applications in areas such as informatics are placing unique requirements on parallel programming tools that have not yet been addressed. Although, of all the available parallel programming models, task parallelism appears to be the most promising in meeting these new challenges, current solutions for task parallelism are inadequate. In this paper, we introduce PFunc, a new library for task parallelism that extends the feature set of current solutions for task parallelism with custom task scheduling, task priorities, task affinities, multiple completion notifications and task groups. These features enable PFunc to naturally and efficiently parallelize a wide variety of modern HPC applications and to support the SPMD model of parallel programming. We present three case studies: demand-driven DAG execution, frequent pattern mining and iterative sparse solvers to demonstrate the utility of PFunc's new features.

[1]  M. Hestenes,et al.  Methods of conjugate gradients for solving linear systems , 1952 .

[2]  Michael J. Flynn,et al.  Some Computer Organizations and Their Effectiveness , 1972, IEEE Transactions on Computers.

[3]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[4]  Iain S. Duff,et al.  The Multifrontal Solution of Unsymmetric Sets of Linear Equations , 1984 .

[5]  Robert H. Halstead,et al.  MULTILISP: a language for concurrent symbolic computation , 1985, TOPL.

[6]  Nicolas Halbwachs,et al.  LUSTRE: a declarative language for real-time programming , 1987, POPL '87.

[7]  Piyush Mehrotra Programming Parallel Architectures: The BLAZE Family of Languages-Invited Talk , 1987, PPSC.

[8]  Frederica Darema,et al.  A single-program-multiple-data computational model for EPEX/FORTRAN , 1988, Parallel Comput..

[9]  R. L. Wexelblat Proceedings of the ACM SIGPLAN 1988 conference on Programming language design and implementation , 1988, PLDI 1989.

[10]  Alexander A. Stepanov,et al.  Generic Programming , 1988, ISSAC.

[11]  Vivek Sarkar,et al.  Partitioning and Scheduling Parallel Programs for Multiprocessing , 1989 .

[12]  I. Foster,et al.  Strand: A practical parallel programming language , 1989 .

[13]  Martín Abadi,et al.  Composing Specifications , 1989, REX Workshop.

[14]  Murray Cole,et al.  Algorithmic Skeletons: Structured Management of Parallel Computation , 1989 .

[15]  David C. Cann,et al.  A Report on the Sisal Language Project , 1990, J. Parallel Distributed Comput..

[16]  Behrooz Shirazi,et al.  Analysis and Evaluation of Heuristic Methods for Static Task Scheduling , 1990, J. Parallel Distributed Comput..

[17]  Robert H. Halstead,et al.  Lazy task creation: a technique for increasing the granularity of parallel programs , 1990, IEEE Trans. Parallel Distributed Syst..

[18]  Ken Kennedy,et al.  An Overview of the Fortran D Programming System , 1991, LCPC.

[19]  Olivier Danvy,et al.  Representing Control: a Study of the CPS Transformation , 1992, Mathematical Structures in Computer Science.

[20]  Guy E. Blelloch,et al.  NESL: A Nested Data-Parallel Language , 1992 .

[21]  Kai Hwang,et al.  Advanced computer architecture - parallelism, scalability, programmability , 1992 .

[22]  Tao Yang,et al.  A Comparison of Clustering Heuristics for Scheduling Directed Acycle Graphs on Multiprocessors , 1992, J. Parallel Distributed Comput..

[23]  Eerke Albert Boiten,et al.  Transformational derivation of (parallel) programs using skeletons , 1993 .

[24]  Peter G. Harrison,et al.  Parallel Programming Using Skeleton Functions , 1993, PARLE.

[25]  CHARM++: A Portable Concurrent Object Oriented System Based On C++ , 1993, OOPSLA.

[26]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[27]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[28]  Guy E. Blelloch,et al.  NESL: A Nested Data-Parallel Language (Version 2.6) , 1993 .

[29]  K. Mani Chandy,et al.  CC++: A Declarative Concurrent Object Oriented Programming Notation , 1993 .

[30]  Robert Olson,et al.  Programming in FORTRAN M , 1993 .

[31]  Thomas R. Gross,et al.  Task Parallelism in a High Performance Fortran Framework , 1994, IEEE Parallel & Distributed Technology: Systems & Applications.

[32]  George Karypis,et al.  Introduction to Parallel Computing , 1994 .

[33]  Robert J. Harrison,et al.  Global Arrays: a portable "shared-memory" programming model for distributed memory computers , 1994, Proceedings of Supercomputing '94.

[34]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[35]  Steven M. Hadfield On The LU Factorization Of Sequences Of Identically Structured Sparse Matrices Within A Distributed , 1994 .

[36]  Barbara M. Chapman,et al.  Extending HPF for Advanced Data-Parallel Applications , 1994, IEEE Parallel & Distributed Technology: Systems & Applications.

[37]  Lawrence Rauchwerger,et al.  Automatic Detection of Parallelism: A grand challenge for high performance computing , 1994, IEEE Parallel & Distributed Technology: Systems & Applications.

[38]  John A. Chandy,et al.  The Paradigm Compiler for Distributed-Memory Multicomputers , 1995, Computer.

[39]  Yike Guo,et al.  Parallel skeletons for structured composition , 1995, PPOPP '95.

[40]  K. Mani Chandy,et al.  Fortran M: A Language for Modular Parallel Programming , 1995, J. Parallel Distributed Comput..

[41]  Monica S. Lam,et al.  Maximizing Multiprocessor Performance with the SUIF Compiler , 1996, Digit. Tech. J..

[42]  Jeffrey M. Squyres,et al.  Object Oriented MPI (OOMPI): a class library for the Message Passing Interface , 1996, Proceedings. Second MPI Developer's Conference.

[43]  Bradford Nichols,et al.  Pthreads programming , 1996 .

[44]  Ian T. Foster,et al.  Compositional parallel programming languages , 1996, TOPL.

[45]  Herbert Kuchen,et al.  TPascal - A Language for Task Parallel Programming , 1996, Euro-Par, Vol. I.

[46]  Sarita V. Adve,et al.  Shared Memory Consistency Models: A Tutorial , 1996, Computer.

[47]  Rakesh Agrawal,et al.  Parallel Mining of Association Rules , 1996, IEEE Trans. Knowl. Data Eng..

[48]  Piyush Mehrotra,et al.  Vienna Fortran and the Path Towards a Standard Parallel Language (Special Issue on Parallel and Distributed Supercomputing) , 1997 .

[49]  Srinivasan Parthasarathy,et al.  New Algorithms for Fast Discovery of Association Rules , 1997, KDD.

[50]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[51]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[52]  Jeffrey J. P. Tsai,et al.  Compositional verification of concurrent systems using Petri-net-based condensation rules , 1998, TOPL.

[53]  Jin-Soo Kim,et al.  Memory characterization of a parallel data mining workload , 1998, Workload Characterization: Methodology and Case Studies. Based on the First Workshop on Workload Characterization.

[54]  Robert W. Numrich,et al.  Co-array Fortran for parallel programming , 1998, FORF.

[55]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[56]  James Demmel,et al.  An Asynchronous Parallel Supernodal Algorithm for Sparse Gaussian Elimination , 1997, SIAM J. Matrix Anal. Appl..

[57]  C. Leiserson,et al.  Scheduling multithreaded computations by work stealing , 1999, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[58]  Sergei Gorlatch,et al.  Skeletons and Transformations in an Integrated Parallel Programming Environment , 1999, PaCT.

[59]  Jeremy G. Siek,et al.  The Matrix Template Library: generic components for high-performance scientific computing , 1999, Comput. Sci. Eng..

[60]  Ishfaq Ahmad,et al.  Benchmarking and Comparison of the Task Graph Scheduling Algorithms , 1999, J. Parallel Distributed Comput..

[61]  Y.-K. Kwok,et al.  Static scheduling algorithms for allocating directed task graphs to multiprocessors , 1999, CSUR.

[62]  Mohammed J. Zaki Parallel and distributed association mining: a survey , 1999, IEEE Concurr..

[63]  Krzysztof Czarnecki,et al.  Generative programming - methods, tools and applications , 2000 .

[64]  Michael A. Bender,et al.  Online Scheduling of Parallel Programs on Heterogeneous Systems with Applications to Cilk , 2002, SPAA '00.

[65]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD 2000.

[66]  Vipin Kumar,et al.  Scalable Parallel Data Mining for Association Rules , 2000, IEEE Trans. Knowl. Data Eng..

[67]  Alexander A. Stepanov,et al.  C++ Standard Template Library , 2000 .

[68]  Srinivasan Parthasarathy,et al.  Parallel Data Mining for Association Rules on Shared-memory Systems , 1998 .

[69]  Nancy M. Amato,et al.  STAPL: A Standard Template Adaptive Parallel C++ Library , 2001 .

[70]  Michael Wolf,et al.  Object‐oriented analysis and design of the Message Passing Interface , 2001, Concurr. Comput. Pract. Exp..

[71]  Nancy M. Amato,et al.  STAPL: An Adaptive, Generic Parallel C++ Library , 2001, LCPC.

[72]  Dennis Gannon,et al.  HPC++ and the HPC++Lib Toolkit , 2001, Compiler Optimizations for Scalable Parallel Systems Languages.

[73]  David R. Musser,et al.  STL tutorial and reference guide , 2001 .

[74]  Frederica Darema,et al.  The SPMD Model : Past, Present and Future , 2001, PVM/MPI.

[75]  Anshul Gupta,et al.  Recent advances in direct methods for solving unsymmetric sparse systems of linear equations , 2002, TOMS.

[76]  Herbert Kuchen,et al.  A Skeleton Library , 2002, Euro-Par.

[77]  Anshul Gupta,et al.  Improved Symbolic and Numerical Factorization Algorithms for Unsymmetric Sparse Matrices , 2002, SIAM J. Matrix Anal. Appl..

[78]  Peter Sanders,et al.  [Delta]-stepping: a parallelizable shortest path algorithm , 2003, J. Algorithms.

[79]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[80]  Bart Goethals,et al.  Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations (FIMI 2004) , 2004 .

[81]  Joost N. Kok,et al.  A quickstart in frequent structure mining can make a difference , 2004, KDD.

[82]  Murray Cole,et al.  Bringing skeletons out of the closet: a pragmatic manifesto for skeletal parallel programming , 2004, Parallel Comput..

[83]  Ken Kennedy,et al.  Defining and Measuring the Productivity of Programming Languages , 2004, Int. J. High Perform. Comput. Appl..

[84]  Srinivasan Parthasarathy,et al.  Parallel algorithms for mining frequent structural motifs in scientific data , 2004, ICS '04.

[85]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[86]  Jeffrey C. Carver,et al.  Parallel Programmer Productivity: A Case Study of Novice Parallel Programmers , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[87]  Srinivasan Parthasarathy,et al.  Adaptive Parallel Graph Mining for CMP Architectures , 2006, Sixth International Conference on Data Mining (ICDM'06).

[88]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[89]  Srinivasan Parthasarathy,et al.  Cache-conscious frequent pattern mining on modern and emerging processors , 2007, The VLDB Journal.

[90]  Andrew Lumsdaine,et al.  Modernizing the C++ Interface to MPI , 2006, PVM/MPI.

[91]  Bjarne Stroustrup,et al.  Specifying C++ concepts , 2006, POPL '06.

[92]  Jonathan W. Berry,et al.  Software and Algorithms for Graph Queries on Multithreaded Architectures , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[93]  Bradford L. Chamberlain,et al.  Parallel Programmability and the Chapel Language , 2007, Int. J. High Perform. Comput. Appl..

[94]  Alejandro Duran,et al.  A Proposal for Task Parallelism in OpenMP , 2007, IWOMP.

[95]  Phillip Colella,et al.  Parallel Languages and Compilers: Perspective From the Titanium Experience , 2007, Int. J. High Perform. Comput. Appl..

[96]  Jeffrey C. Carver,et al.  Software Development Environments for Scientific and Engineering Software: A Series of Case Studies , 2007, 29th International Conference on Software Engineering (ICSE'07).

[97]  Jay Hoeflinger Programming with cluster openMP , 2007, PPOPP.

[98]  Andrew Lumsdaine,et al.  Parallelization of Generic Libraries Based on Type Properties , 2007, International Conference on Computational Science.

[99]  Victor Luchangco,et al.  The Fortress Language Specification Version 1.0 , 2007 .

[100]  Lawrence Snyder,et al.  The design and development of ZPL , 2007, HOPL.

[101]  Jonathan W. Berry,et al.  Challenges in Parallel Graph Processing , 2007, Parallel Process. Lett..

[102]  Claudia Fohry,et al.  Problems, Workarounds and Possible Solutions Implementing the Singleton Pattern with C++ and OpenMP , 2007, IWOMP.

[103]  James Reinders,et al.  Intel® threading building blocks , 2008 .

[104]  Peter M. Kogge,et al.  On the Memory Access Patterns of Supercomputer Applications: Benchmark Selection and Its Implications , 2007, IEEE Transactions on Computers.

[105]  Sriram Krishnamoorthy,et al.  Solving Large, Irregular Graph Problems Using Adaptive Work-Stealing , 2008, 2008 37th International Conference on Parallel Processing.

[106]  Verdi March,et al.  Survey on Parallel Programming Model , 2008, NPC.

[107]  Andrew Lumsdaine,et al.  OpenMP Extensions for Generic Libraries , 2008, IWOMP.

[108]  Andrew Lumsdaine,et al.  Design and implementation of a high-performance MPI for C# and the common language infrastructure , 2008, PPOPP.

[109]  George Almási,et al.  Performance without pain = productivity: data layout and collective communication in UPC , 2008, PPoPP.

[110]  Taiichi Yuasa,et al.  Backtracking-based load balancing , 2009, PPoPP '09.

[111]  Shirish Tatikonda,et al.  Mining Tree-Structured Data on Multicore Systems , 2009, Proc. VLDB Endow..

[112]  Andrew Lumsdaine,et al.  Extending Task Parallelism For Frequent Pattern Mining , 2012, PARCO.

[113]  Edward T. Grochowski,et al.  Larrabee: A many-Core x86 architecture for visual computing , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[114]  Yi Guo,et al.  Work-first and help-first scheduling policies for async-finish task parallelism , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[115]  Torsten Hoefler,et al.  Demand-driven execution of static directed acyclic graphs using task parallelism , 2009, 2009 International Conference on High Performance Computing (HiPC).

[116]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.