Enforce a Reliable Environment in Parallel Computing Applications

Abstract The importance of parallel computing is growing rapidly, and the need of performing complex computation tasks in shorter time becomes the main factor of developing the technology. In almost all technology fields, we can save time and effort to complete the computation of complex tasks. Modeling the Galaxy formation, Climate change, Rush hour traffic are some examples in real life of how we can make use of parallel computing, comparing with solving them in the tradition serial way. There are some limitations for the scalability of the resources. Therefore, trends are moving toward using heterogeneous environment that can provide more scalable resources. The main challenge then is to provide reliability between the computing resources. In this work, we will utilize publish/subscribe model using quality of service (QoS) parameters in the Data Distribution Service (DDS) middleware. DDS is developed by Object Management Group (OMG).

[1]  Dennis Heimbigner,et al.  A planning based approach to failure recovery in distributed systems , 2004, WOSS '04.

[2]  Bianca Schroeder,et al.  Understanding failures in petascale computers , 2007 .

[3]  N. D. Durie,et al.  Digest of papers , 1976 .

[4]  Sébastien Monnet,et al.  Building Fault-Tolerant Consistency Protocols for an Adaptive Grid Data-Sharing Service , 2004 .

[5]  Bryan Cantrill,et al.  Real-world concurrency , 2008, Commun. ACM.

[6]  Frank Mueller,et al.  Hybrid MPI/OpenMP programming on the Tilera manycore architecture , 2016, 2016 International Conference on High Performance Computing & Simulation (HPCS).

[7]  Marek Olszewski,et al.  Kendo: efficient deterministic multithreading in software , 2009, ASPLOS.

[8]  William Gropp,et al.  Skjellum using mpi: portable parallel programming with the message-passing interface , 1994 .

[9]  James Demmel,et al.  the Parallel Computing Landscape , 2022 .

[10]  L. Kish End of Moore's law: thermal (noise) death of integration in micro and nano electronics , 2002 .

[11]  Ming Xue,et al.  A Hybrid MPI–OpenMP Parallel Algorithm and Performance Analysis for an Ensemble Square Root Filter Designed for Multiscale Observations , 2013 .

[12]  Brendan Murphy,et al.  Measuring system and software reliability using an automated data collection process , 1995 .

[13]  Ravishankar K. Iyer,et al.  Failure analysis and modeling of a VAXcluster system , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[14]  Wooyoung Kim,et al.  Multicore Desktop Programming with Intel Threading Building Blocks , 2011, IEEE Software.

[15]  Daniel Marques,et al.  Recent advances in checkpoint/recovery systems , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[16]  Erhan Okuyan,et al.  Direct volume rendering of unstructured tetrahedral meshes using CUDA and OpenMP , 2013, The Journal of Supercomputing.

[17]  Peter S. Pacheco Parallel programming with MPI , 1996 .

[18]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2010, IEEE Trans. Dependable Secur. Comput..

[19]  Barbara Chapman,et al.  Using OpenMP - portable shared memory parallel programming , 2007, Scientific and engineering computation.

[20]  Jie Xu,et al.  Fault Tolerance within a Grid Environment , 2003 .

[21]  David R. Butenhof Programming with POSIX threads , 1993 .