Lightweight monitoring of MPI programs in real time

Current technologies allow efficient data collection by several sensors to determine an overall evaluation of the status of a cluster. However, no previous work of which we are aware analyzes the behavior of the parallel programs themselves in real time. In this paper, we perform a comparison of different artificial intelligence techniques that can be used to implement a lightweight monitoring and analysis system for parallel applications on a cluster of Linux workstations. We study the accuracy and performance of deterministic and stochastic algorithms when we observe the flow of both library‐function and operating‐system calls of parallel programs written with C and MPI. We demonstrate that monitoring of MPI programs can be achieved with high accuracy and in some cases with a false‐positive rate near 0% in real time, and we show that the added computational load on each node is small. As an example, the monitoring of function calls using a hidden Markov model generates less than 5% overhead. The proposed system is able to automatically detect deviations of a process from its expected behavior in any node of the cluster, and thus it can be used as an anomaly detector, for performance monitoring to complement other systems or as a debugging tool. Copyright © 2005 John Wiley & Sons, Ltd.

[1]  Roberto Baldoni,et al.  CORBA request portable interceptors: analysis and applications , 2003, Concurr. Comput. Pract. Exp..

[2]  Terran Lane,et al.  Hidden Markov Models for Human/Computer Interface Modeling , 1999 .

[3]  Zhen Liu,et al.  Attacking a High Performance Computer Cluster , 2004 .

[4]  Stephanie Forrest,et al.  A sense of self for Unix processes , 1996, Proceedings 1996 IEEE Symposium on Security and Privacy.

[5]  Zhen Liu,et al.  A comparison of input representations in neural networks: a case study in intrusion detection , 2002, Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 (Cat. No.02CH37290).

[6]  Zhen Liu,et al.  Classification of anomalous traces of privileged and parallel programs by neural networks , 2003, The 12th IEEE International Conference on Fuzzy Systems, 2003. FUZZ '03..

[7]  R. Sekar,et al.  User-Level Infrastructure for System Call Interposition: A Platform for Intrusion Detection and Confinement , 2000, NDSS.

[8]  T. Mitchem,et al.  Using kernel hypervisors to secure applications , 1997, Proceedings 13th Annual Computer Security Applications Conference.

[9]  Philippe Augerat,et al.  Scalable monitoring and configuration tools for grids and clusters , 2002, Proceedings 10th Euromicro Workshop on Parallel, Distributed and Network-based Processing.

[10]  Rajkumar Buyya,et al.  High Performance Cluster Computing , 1999 .

[11]  Anthony Skjellum,et al.  A Multithreaded Message Passing Interface (MPI) Architecture: Performance and Program Issues , 2001, J. Parallel Distributed Comput..

[12]  Anthony Skjellum,et al.  A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..

[13]  Michael R. Lyu,et al.  Software fault tolerance in a clustered architecture: techniques and reliability modeling , 1999, 1999 IEEE Aerospace Conference. Proceedings (Cat. No.99TH8403).

[14]  Stephanie Forrest,et al.  Operating system stability and security through process homeostasis , 2002 .

[15]  Timothy Fraser,et al.  Hardening COTS software with generic software wrappers , 1999, Proceedings of the 1999 IEEE Symposium on Security and Privacy (Cat. No.99CB36344).

[16]  James Coyle,et al.  Deadlock detection in MPI programs , 2002, Concurr. Comput. Pract. Exp..

[17]  Bronis R. de Supinski,et al.  Dynamic Software Testing of MPI Applications with Umpire , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[18]  Ian Goldberg,et al.  A Secure Environment for Untrusted Helper Applications ( Confining the Wily Hacker ) , 1996 .

[19]  Peter G. Neumann,et al.  EMERALD: Event Monitoring Enabling Responses to Anomalous Live Disturbances , 1997, CCS 2002.

[20]  Eugene H. Spafford,et al.  Generation of Application Level Audit Data via Library Interposition , 1998 .

[21]  Stephen Smalley,et al.  Integrating Flexible Support for Security Policies into the Linux Operating System , 2001, USENIX Annual Technical Conference, FREENIX Track.

[22]  David E. Culler,et al.  The ganglia distributed monitoring system: design, implementation, and experience , 2004, Parallel Comput..

[23]  Thomas E. Anderson,et al.  SLIC: An Extensibility System for Commodity Operating Systems , 1998, USENIX ATC.

[24]  Moreno Marzolla,et al.  A performance monitoring system for large computing clusters , 2003, Eleventh Euromicro Conference on Parallel, Distributed and Network-Based Processing, 2003. Proceedings..

[25]  Anthony Skjellum,et al.  Shared-memory communication approaches for an MPI message-passing library , 2000 .

[26]  Yu Lin,et al.  Application intrusion detection using language library calls , 2001, Seventeenth Annual Computer Security Applications Conference.

[27]  Nigel Edwards,et al.  A Secure Linux Platform , 2001, Annual Linux Showcase & Conference.

[28]  Michael Schatz,et al.  Learning Program Behavior Profiles for Intrusion Detection , 1999, Workshop on Intrusion Detection and Network Monitoring.

[29]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[30]  Timothy W. Curry,et al.  Profiling and Tracing Dynamic Library Usage Via Interposition , 1994, USENIX Summer.

[31]  Calton Pu,et al.  SubDomain: Parsimonious Server Security , 2000, LISA.

[32]  Anthony Skjellum,et al.  Designing shared-memory communication devices for MPI message-passing library * , 1998 .

[33]  David Walker,et al.  A type system for expressive security policies , 2000, POPL '00.

[34]  Barak A. Pearlmutter,et al.  Detecting intrusions using system calls: alternative data models , 1999, Proceedings of the 1999 IEEE Symposium on Security and Privacy (Cat. No.99CB36344).

[35]  Stephanie Forrest,et al.  Intrusion Detection Using Sequences of System Calls , 1998, J. Comput. Secur..

[36]  Karl N. Levitt,et al.  Automated detection of vulnerabilities in privileged programs by execution monitoring , 1994, Tenth Annual Computer Security Applications Conference.