FRPA: A Framework for Recursive Parallel Algorithms

Abstract : Recursion continues to play an important role in high-performance computing. However, parallelizing recursive algorithms while achieving high performance is nontrivial and can result in complex, hard to maintain code. In particular, assigning processors to subproblems is complicated by recent observations that communication costs often dominate computation costs. Previous work [1][3] demonstrates that carefully choosing which divide-and-conquer steps to execute in parallel (breadth-first steps) and which to execute sequentially (depth-first steps) can result in significant performance gains over naive scheduling. Our Framework for Recursive Parallel Algorithms (FRPA) allows for the separation of an algorithms implementation from its parallelization. The programmer must simply define how to split a problem, solve the base case, and merge solved subproblems; FRPA handles parallelizing the code and tuning the recursive parallelization strategy, enabling algorithms to achieve high performance. To demonstrate FRPAs performance capabilities, we present a detailed analysis of two algorithms: Strassen-Winograd [1] and Communication-Optimal Parallel Recursive Rectangular Matrix Multiplication (CARMA) [3]. Our single-precision CARMA implementation is fewer than 80 lines of code and achieves a speedup of up to 11x over Intels Math Kernel Library (MKL) [4] matrix multiplication routine on skinny matrices. Our double-precision Strassen-Winograd implementation, at just 150 lines of code, is up to 45 faster than MKL for large square matrix multiplications. To show FRPAs generality and simplicity, we implement six additional algorithms: mergesort, quicksort, TRSM, SYRK, Cholesky decomposition, and Delaunay triangulation [5]. FRPA is implemented in C++, runs in shared-memory environments, uses Intels Cilk Plus [6] for task-based parallelism, and leverages OpenTuner [7] to tune the parallelization strategy.

[1]  Shoaib Kamil,et al.  OpenTuner: An extensible framework for program autotuning , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[2]  John Shalf,et al.  SEJITS: Getting Productivity and Performance With Selective Embedded JIT Specialization , 2010 .

[3]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[4]  Vijaya Ramachandran,et al.  Oblivious algorithms for multicores and network of processors , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[5]  James Demmel,et al.  Communication-optimal parallel algorithm for strassen's matrix multiplication , 2012, SPAA '12.

[6]  Helmar Burkhart,et al.  PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[7]  Richard Cole,et al.  Resource Oblivious Sorting on Multicores , 2010, ICALP.

[8]  James Demmel,et al.  Communication-Optimal Parallel Recursive Rectangular Matrix Multiplication , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[9]  Geppino Pucci,et al.  Network-Oblivious Algorithms , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[10]  Rosa M. Badia,et al.  CellSs: a Programming Model for the Cell BE Architecture , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[11]  Shoaib Kamil,et al.  Bringing Parallel Performance to Python with Domain-Specific Selective Embedded Just-in-Time Specialization , 2011, SciPy.

[12]  Jack J. Dongarra,et al.  A Portable Programming Interface for Performance Evaluation on Modern Processors , 2000, Int. J. High Perform. Comput. Appl..

[13]  Charles E. Leiserson,et al.  Cache-Oblivious Algorithms , 2003, CIAC.

[14]  James Demmel,et al.  Communication-Avoiding Parallel Strassen: Implementation and performance , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[15]  Alan Edelman,et al.  PetaBricks: a language and compiler for algorithmic choice , 2009, PLDI '09.

[16]  Don Coppersmith,et al.  Matrix multiplication via arithmetic progressions , 1987, STOC.

[17]  Jesús Labarta,et al.  A dependency-aware task-based programming environment for multi-core architectures , 2008, 2008 IEEE International Conference on Cluster Computing.

[18]  Leonidas J. Guibas,et al.  Primitives for the manipulation of general subdivisions and the computation of Voronoi diagrams , 1983, STOC.

[19]  Bruno Raffin,et al.  XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[20]  Guy E. Blelloch,et al.  Programming parallel algorithms , 1996, CACM.

[21]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.