Towards Developing a Repository of Logical Errors Observed in Parallel Code for Teaching Code Correctness

Debugging parallel programs can be a challenging task, especially for the beginners. While the debuggers like DDT and TotalView can be extremely useful in tracking down the program statements that are connected to the bugs, often the onus is on the programmers to reason about the logic of the program statements in order to fix the bugs in them. These debuggers may neither be able to precisely indicate the logical errors in the parallel programs nor they may provide information on fixing those errors. Therefore, there is a need for developing tools and educational content on teaching the pitfalls in parallel programming and writing correct code. Such content can be useful to guide the beginners in avoiding commonly observed logical errors and in verifying the correctness of their parallel programs. In this paper, we 1) enumerate some of the logical errors that we have seen in the parallel programs (OpenMP, MPI, and CUDA) that were written by the beginners working with us, and 2) discuss the ways to fix those errors. The errors are mainly related to the data distribution, exiting distributed for-loops, and workload-imbalance. The documentation on these logical errors can contribute in enhancing the productivity of the beginners, and can potentially help them in their debugging efforts. We have added the code samples containing logical errors and their solutions in a Github repository so that the others in the community can reproduce the errors on their systems and learn from them. The content presented in this paper may also be useful for those developing high-level tools for detecting and removing logical errors in parallel programs.

[1]  Marjan Mernik,et al.  Raising the level of abstraction for developing message passing applications , 2010, The Journal of Supercomputing.

[2]  Simone Atzeni,et al.  SWORD: A Bounded Memory-Overhead Detector of OpenMP Data Races in Production Runs , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[3]  Young-Joo Kim,et al.  ADAT: An Adaptable Dynamic Analysis Tool for Race Detection in OpenMP Programs , 2011, 2011 IEEE Ninth International Symposium on Parallel and Distributed Processing with Applications.

[4]  Armando Solar-Lezama,et al.  Report of the HPC Correctness Summit, Jan 25-26, 2017, Washington, DC , 2017, ArXiv.

[5]  Zizhong Chen,et al.  Non-intrusively Avoiding Scaling Problems in and out of MPI Collectives , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[6]  Michael Boyer Automated Dynamic Analysis of CUDA Programs , 2008 .

[7]  Claudia Fohry,et al.  Common Mistakes in OpenMP and How to Avoid Them - A Collection of Best Practices , 2005, IWOMP.

[8]  Zbigniew J. Czech,et al.  Introduction to Parallel Computing , 2017 .

[9]  Victor Samofalov,et al.  Automated, scalable debugging of MPI programs with Intel® Message Checker , 2005, SE-HPCS '05.

[10]  George S. Avrunin,et al.  Combining symbolic execution with model checking to verify parallel numerical programs , 2008, TSEM.

[11]  Barry Wilkinson,et al.  The Suzaku Pattern Programming Framework , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[12]  Ritu Arora,et al.  A Tool for Interactive Parallelization , 2014, XSEDE '14.

[13]  M. R. Anala,et al.  Towards automatic parallelization of “for” loops , 2015, 2015 IEEE International Advance Computing Conference (IACC).

[14]  Martin Schulz,et al.  A graph based approach for MPI deadlock detection , 2009, ICS '09.

[15]  Sorin Lerner,et al.  Verifying GPU kernels by test amplification , 2012, PLDI.

[16]  Mark Priestley,et al.  The logic of correctness in software engineering , 2011 .

[17]  Weimin Zheng,et al.  What Is Wrong with the Transmission? A Comprehensive Study on Message Passing Related Bugs , 2015, 2015 44th International Conference on Parallel Processing.

[18]  James L. Peterson,et al.  Petri Nets , 1977, CSUR.

[19]  Ritu Arora,et al.  ITALC: Interactive Tool for Application-Level Checkpointing , 2017 .

[20]  Qi Gao,et al.  FlowChecker: Detecting Bugs in MPI Libraries via Message Flow Checking , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[21]  Cheng Li,et al.  A study of the internal and external effects of concurrency bugs , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).

[22]  Adam Betts,et al.  GPUVerify: a verifier for GPU kernels , 2012, OOPSLA '12.

[23]  Junfeng Yang,et al.  An empirical study of operating systems errors , 2001, SOSP.

[24]  Lin Yan,et al.  Correctness Analysis Based on Testing and Checking for OpenMP Programs , 2009, 2009 Fourth ChinaGrid Annual Conference.

[25]  Ganesh Gopalakrishnan,et al.  Scalable verification of MPI programs , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[26]  Michael M. Resch,et al.  MARMOT: An MPI Analysis and Checking Tool , 2003, PARCO.

[27]  Jisha P Abraham,et al.  Automatic Code Parallelization with OpenMP task constructs , 2016, 2016 International Conference on Information Science (ICIS).