Issues with and approaches to network monitoring and problem remediation in military tactical networks

This paper describes an approach to root cause analysis and fault correlation that addresses the problems inherent in wireless military networks.1 Root cause analysis concerns itself with identifying and correcting problems in a network. The ultimate goal of root cause analysis is to diagnose the cause for network anomalies, towards the ultimate goal of ensuring that adequate communication functionality is maintained to support the requirements of the network users. In a wired network, the diagnosis of faults is easier due to the existence of a fixed, wired topology and fixed, wired links. In the wireless networks at the tactical edge of military networks, there is no hardwired connectivity, yet there are also expectations on the network from the end users which place constraints on the operation of the network. Fault diagnosis in such networks is fundamentally different from that in wired networks. The performance of the network must be managed explicitly with respect to its user expectations, even though the network connectivity is dynamic, the network monitoring traffic must traverse the (possibly failing) network itself, and the “correct” behavior of the network against which current network state needs to be compared evolves over time. The novel features of our solution that distinguish it from existing root cause analysis techniques are (a) a dynamic model of fault, performance and security problem propagation in the network that can evolve as the definition of network correctness changes, (b) a method for distributing reasoning over this model throughout the network into independent Correlators that share information through a set of Clearinghouses to provide a global root cause correlation capability, and (c) the ability for the Correlator and Clearinghouse reasoning to adapt gracefully when network problems prevent full exchange of information required for root cause analysis.

[1]  Yibei Ling,et al.  Computing diagnostic explanations of network faults from monitoring data , 2008, MILCOM 2008 - 2008 IEEE Military Communications Conference.

[2]  Ramesh Viswanathan,et al.  A conceptual framework for network management event correlation and filtering systems , 1999, Integrated Network Management VI. Distributed Management for the Networked Millennium. Proceedings of the Sixth IFIP/IEEE International Symposium on Integrated Network Management. (Cat. No.99EX302).

[3]  Declan O'Sullivan,et al.  Distributed fault correlation scheme using a semantic publish/subscribe system , 2008, NOMS 2008 - 2008 IEEE Network Operations and Management Symposium.

[4]  Pablo Arozarena Llopis,et al.  MADEIRA: a peer-to-peer approach to network management , 2006 .

[5]  Malgorzata Steinder,et al.  Probabilistic fault diagnosis in communication systems through incremental hypothesis updating , 2004, Comput. Networks.

[6]  Josef Schroeder,et al.  Future Combat Systems , 2001 .

[7]  John Lee,et al.  Virtual Ad hoc Network testbeds for high fidelity testing of tactical network applications , 2009, MILCOM 2009 - 2009 IEEE Military Communications Conference.

[8]  Adele H. Marshall,et al.  Exploring dynamic Bayesian belief networks for intelligent fault management systems , 2000, Smc 2000 conference proceedings. 2000 ieee international conference on systems, man and cybernetics. 'cybernetics evolving to systems, humans, organizations, and their complex interactions' (cat. no.0.

[9]  Malgorzata Steinder,et al.  A survey of fault localization techniques in computer networks , 2004, Sci. Comput. Program..

[10]  Malgorzata Steinder,et al.  Multidomain Diagnosis of End-to-End Service Failures in Hierarchically Routed Networks , 2007, IEEE Transactions on Parallel and Distributed Systems.

[11]  Malgorzata Steinder,et al.  Multi-domain Diagnosis of End-to-End Service Failures in Hierarchically Routed Networks , 2004, NETWORKING.

[12]  P. Venkataram,et al.  Network Fault Diagnosis Using a Realistic Abductive Reasoning Model , 1995 .