Reliable scalable symbolic computation: the design of SymGridPar2

Abstract Symbolic computation is an important area of both Mathematics and Computer Science, with many large computations that would benefit from parallel execution. Symbolic computations are, however, challenging to parallelise as they have complex data and control structures, and both dynamic and highly irregular parallelism. The SymGridPar framework (SGP) has been developed to address these challenges on small-scale parallel architectures. However the multicore revolution means that the number of cores and the number of failures are growing exponentially, and that the communication topology is becoming increasingly complex. Hence an improved parallel symbolic computation framework is required. This paper presents the design and initial evaluation of SymGridPar2 (SGP2), a successor to SymGridPar that is designed to provide scalability onto 10 5 cores, and hence also provide fault tolerance. We present the SGP2 design goals, principles and architecture. We describe how scalability is achieved using layering and by allowing the programmer to control task placement. We outline how fault tolerance is provided by supervising remote computations, and outline higher-level fault tolerance abstractions. We describe the SGP2 implementation status and development plans. We report the scalability and efficiency, including weak scaling to about 32,000 cores, and investigate the overheads of tolerating faults for simple symbolic computations.

[1]  Hans-Wolfgang Loidl,et al.  Architecture Aware Parallel Programming in Glasgow Parallel Haskell (GPH) , 2012, ICCS.

[2]  LintonS.,et al.  Easy composition of symbolic computation software using SCSCP , 2013 .

[3]  HölzleUrs,et al.  Web Search for a Planet , 2003 .

[4]  Gene Cooperman Parallel GAP: Mature interactive parallel computing , 2001 .

[5]  Robert J. Stewart Reliable massively parallel symbolic computing: fault tolerance for a distributed Haskell , 2013 .

[6]  Hector Garcia-Molina,et al.  Consistency in a partitioned network: a survey , 1985, CSUR.

[7]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[8]  Jason Maassen,et al.  Fault-Tolerant Scheduling of Fine-Grained Tasks in Grid Environments , 2006, Int. J. High Perform. Comput. Appl..

[9]  Philip W. Trinder,et al.  Orchestrating computational algebra components into a high-performance parallel system , 2012, Int. J. High Perform. Comput. Netw..

[10]  Steve Linton,et al.  Easy composition of symbolic computation software using SCSCP: A new Lingua Franca for symbolic computation , 2013, J. Symb. Comput..

[11]  Bruce W. Char,et al.  Maple V Language Reference Manual , 1993, Springer US.

[12]  Simon L. Peyton Jones,et al.  A monad for deterministic parallelism , 2012, Haskell '11.

[13]  Kevin Hammond,et al.  Orchestrating production computer algebra components into portable parallel programs , 2008 .

[14]  Philip W. Trinder,et al.  Supervised Workpools for Reliable Massively Parallel Computing , 2012, Trends in Functional Programming.

[15]  Stephen M. Watt,et al.  International Workshop on Parallel Symbolic Computation, PASCO '07, July 27-28, 2007, London, Ontario, Canada , 2007 .

[16]  Philip W. Trinder,et al.  Implementing a High-Level Distributed-Memory Parallel Haskell in Haskell , 2011, IFL.

[17]  Philip W. Trinder,et al.  Reliable scalable symbolic computation: the design of SymGridPar2 , 2013, SAC '13.

[18]  Robert H. Halstead,et al.  MULTILISP: a language for concurrent symbolic computation , 1985, TOPL.

[19]  Marco Schneider,et al.  Self-stabilization , 1993, CSUR.

[20]  Murray Cole,et al.  Algorithmic Skeletons: Structured Management of Parallel Computation , 1989 .

[21]  Peter B. Borwein,et al.  Sign changes in sums of the Liouville function , 2008, Math. Comput..

[22]  Amin Vahdat,et al.  A scalable, commodity data center network architecture , 2008, SIGCOMM '08.

[23]  Frank Lübeck,et al.  Enumerating Large Orbits and Direct Condensation , 2001, Exp. Math..

[24]  William Gropp,et al.  Fault Tolerance in Message Passing Interface Programs , 2004, Int. J. High Perform. Comput. Appl..

[25]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[26]  Franck Cappello,et al.  Toward Exascale Resilience , 2009, Int. J. High Perform. Comput. Appl..

[27]  Rita Loogen,et al.  Under Consideration for Publication in J. Functional Programming Parallel Functional Programming in Eden , 2022 .

[28]  Hans-Wolfgang Loidl,et al.  Algorithm + strategy = parallelism , 1998, Journal of Functional Programming.

[29]  Ewing Lusk,et al.  Fault Tolerance in MPI Programs , 2009 .

[30]  Karthikeyan P. Saravanan,et al.  Power/performance evaluation of energy efficient Ethernet (EEE) for High Performance Computing , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[31]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[32]  Wolfgang Küchlin,et al.  PARSAC-2: A Parallel SAC-2 Based on Threads , 1990, AAECC.

[33]  Claus Fieker,et al.  Kant V4 , 1997, J. Symb. Comput..

[34]  Vladimir Janjic,et al.  Granularity-Aware Work-Stealing for Computationally-Uniform Grids , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[35]  Meinolf Geck,et al.  James' Conjecture for Hecke algebras of exceptional type, I , 2007, 0712.1620.

[36]  Greg J. Michaelson,et al.  Low-pain, high-gain multicore programming in Haskell: coordinating irregular symbolic computations on multicore architectures , 2009, DAMP '09.

[37]  Luiz André Barroso,et al.  Web Search for a Planet: The Google Cluster Architecture , 2003, IEEE Micro.

[38]  Christoph Lameter,et al.  An overview of non-uniform memory access , 2013, CACM.