What Is Wrong with the Transmission? A Comprehensive Study on Message Passing Related Bugs

Along with the prevalence of distributed systems, more and more applications require the ability of reliably transferring messages across a network. However, passing messages in a convenient and dependable way is both difficult and error prone. Thus the existing messaging products usually suffer from numerous software bugs. And these bugs are particularly difficult to be diagnosed or avoided. Therefore, in order to improve the methods for handling them, we need a better understanding of their characteristics. This paper provides the first (to the best of our knowledge)comprehensive characteristic study on message passing related bugs (MP-bugs). We have carefully examined the pattern, manifestation, fixing and other characteristics of 349 randomly selected real world MP-bugs from 3 representative open-source applications (Open MPI, Zero MQ, and Active MQ). Surprisingly, we found that nearly 60% of the non-latent MP-bugs can be categorised into two simple patterns: the message level bugs and the connection level bugs, which implies a promising perspective of detecting/tolerating tools for MP-bugs. Apart from this finding, our study have also uncovered many new (and sometimes surprising)insights of the message passing systems' developing process. The results should be useful for the design of corresponding bug detecting, exposing and tolerating tools.

[1]  Gerard J. Holzmann,et al.  The Model Checker SPIN , 1997, IEEE Trans. Software Eng..

[2]  William McCune,et al.  SPINning Parallel Systems Software , 2002, SPIN.

[3]  Ganesh Gopalakrishnan,et al.  Efficient Verification Solutions for Message Passing Systems , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[4]  Andrea C. Arpaci-Dusseau,et al.  Tolerating File-System Mistakes with EnvyFS , 2009, USENIX Annual Technical Conference.

[5]  Philip J. Guo,et al.  Characterizing and predicting which bugs get reopened , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[6]  Junfeng Yang,et al.  Bypassing Races in Live Applications with Execution Filters , 2010, OSDI.

[7]  Mark Sullivan,et al.  A comparison of software defects in database management systems and operating systems , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[8]  Vikram S. Adve,et al.  An empirical study of reported bugs in server software with implications for automated bug diagnosis , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[9]  Konstantin Serebryany,et al.  ThreadSanitizer: data race detection in practice , 2009, WBIA '09.

[10]  Martin Schulz,et al.  ScalaTrace: Scalable compression and replay of communication traces for high-performance computing , 2008, J. Parallel Distributed Comput..

[11]  Junfeng Yang,et al.  An empirical study of operating systems errors , 2001, SOSP.

[12]  Qi Gao,et al.  FlowChecker: Detecting Bugs in MPI Libraries via Message Flow Checking , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[13]  Yuanyuan Zhou,et al.  Learning from mistakes: a comprehensive study on real world concurrency bug characteristics , 2008, ASPLOS.

[14]  Ganesh Gopalakrishnan,et al.  Dynamic Verification of MPI Programs with Reductions in Presence of Split Operations and Relaxed Orderings , 2008, CAV.

[15]  Andrea C. Arpaci-Dusseau,et al.  A Study of Linux File System Evolution , 2013, FAST.

[16]  Matthias Hauswirth,et al.  Low-overhead memory leak detection using adaptive statistical profiling , 2004, ASPLOS XI.

[17]  Satish Narayanasamy,et al.  Tolerating Concurrency Bugs Using Transactions as Lifeguards , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[18]  Cheng Li,et al.  A study of the internal and external effects of concurrency bugs , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).

[19]  Algirdas A. Avi The Methodology of N-Version Programming , 1995 .

[20]  Wenguang Chen,et al.  MPIWiz: subgroup reproducible replay of mpi applications , 2009, PPoPP '09.

[21]  Satish Narayanasamy,et al.  A case for an interleaving constrained shared-memory multi-processor , 2009, ISCA '09.

[22]  Nicholas Nethercote,et al.  How to shadow every byte of memory used by a program , 2007, VEE '07.

[23]  Victor Samofalov,et al.  Automated, scalable debugging of MPI programs with Intel® Message Checker , 2005, SE-HPCS '05.

[24]  Shan Lu,et al.  AI: a lightweight system for tolerating concurrency bugs , 2014, SIGSOFT FSE.

[25]  George S. Avrunin,et al.  Verification of MPI-Based Software for Scientific Computation , 2004, SPIN.

[26]  Nicholas Nethercote,et al.  Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.

[27]  Rajeev Thakur,et al.  Formal verification of practical MPI programs , 2009, PPoPP '09.

[28]  Yuanyuan Zhou,et al.  BugBench: Benchmarks for Evaluating Bug Detection Tools , 2005 .