Detecting anomalies in high-performance parallel programs

Message passing interface (MPI) is an effective programming technique for implementing parallel programs for distributed computation. As these applications run, a number of different types of irregularities can occur including those that result from intrusions, user misbehavior, corrupted data, deadlocks or failure of cluster components. We perform a comparison of different artificial intelligence (AI) techniques that can be used to implement a lightweight monitoring and detection system for parallel applications on a cluster of Linux workstations. We study the accuracy and performance of deterministic and stochastic algorithms when we observe the flow of function library and OS system calls of parallel programs written with MPI. We demonstrate that monitoring of MPI programs can be achieved with high accuracy and in some cases with a 0% false positive rate in real-time, and we show that the added computational load on each node is small. Finally we demonstrate that simple deterministic methods perform poorly when the program flow grows in size and variety, and that more complex methods are required.

[1]  Zhen Liu,et al.  Attacking a High Performance Computer Cluster , 2004 .

[2]  Barak A. Pearlmutter,et al.  Detecting intrusions using system calls: alternative data models , 1999, Proceedings of the 1999 IEEE Symposium on Security and Privacy (Cat. No.99CB36344).

[3]  Stephanie Forrest,et al.  Operating system stability and security through process homeostasis , 2002 .

[4]  Stephanie Forrest,et al.  A sense of self for Unix processes , 1996, Proceedings 1996 IEEE Symposium on Security and Privacy.

[5]  James Coyle,et al.  Deadlock detection in MPI programs , 2002, Concurr. Comput. Pract. Exp..

[6]  Philippe Augerat,et al.  Scalable monitoring and configuration tools for grids and clusters , 2002, Proceedings 10th Euromicro Workshop on Parallel, Distributed and Network-based Processing.

[7]  Paul Douglas,et al.  International Conference on Information Technology : Coding and Computing , 2003 .

[8]  Peter G. Neumann,et al.  EMERALD: Event Monitoring Enabling Responses to Anomalous Live Disturbances , 1997, CCS 2002.

[9]  Zhen Liu,et al.  A comparison of input representations in neural networks: a case study in intrusion detection , 2002, Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 (Cat. No.02CH37290).

[10]  Bronis R. de Supinski,et al.  Dynamic Software Testing of MPI Applications with Umpire , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[11]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[12]  Timothy W. Curry,et al.  Profiling and Tracing Dynamic Library Usage Via Interposition , 1994, USENIX Summer.

[13]  Thomas E. Anderson,et al.  SLIC: An Extensibility System for Commodity Operating Systems , 1998, USENIX Annual Technical Conference.