1 An Artificial Intelligence Approach to Network Fault Management ‡

Traditionally, network management activities, such as fault management, have been performed with direct human involvement. However, these activities are becoming more demanding and data intensive, due to the heterogeneous nature and increasing size of networks today. For these reasons, it is becoming necessary to automate network management activities. Artificial intelligence technologies can play an important role in the problem solving and reasoning techniques that are employed in fault management. Expert systems have been successfully applied to some types of fault management. However, these systems are not flexible enough for today’s evolving network needs. We propose a hybrid AI solution that employs both neural networks and case-based reasoning techniques for the fault management of heterogeneous distributed information networks. Overview of Fault Management Activities Today’s high speed, heterogeneous networks represent a complex and data intensive environment that requires different solutions from the traditional methods performed by human operators. Automation of network management activities can benefit from the use of artificial intelligence (AI) technologies, including fault management, performance analysis, and traffic management. Here we focus on fault management, where the goal is to proactively diagnose the cause of abnormal network behavior and to propose, and if possible, take corrective actions. First an overview of fault management activities and responsibilities is given. Then follows a discussion on how AI technologies may be used to automate the fault management process, in particular neural networks (NNs) and case-based reasoning (CBR). The following discussion assumes a telecommunications synchronous optical network (SONET) with asynchronous transfer mode (ATM) switching. However, the same model can be applied to any heterogeneous distributed information network. Essentially, network faults can be classified into hardware and software faults, which cause elements to produce incorrect outputs, which in turn can cause overall failure effects in the network such as congestion (Wang, 1989). Examples of hardware faults are failures of an element due to a failing or a weakness in their logical design, or elements malfunctioning due to simple wear and tear or through external forces such as accidents, acts of nature, being mishandled, or improperly installed. Examples of software faults include failure of elements due to incorrect or incomplete design of their software, erratic behavior of elements or the network due to software bugs (e.g., incorrect packet header processing), and slow or faulty service by the network due to incorrect information (e.g., incorrect routing tables). The flow of fault management, shown in Figure 1, can be described as follows; (1) collect alarms, (2) maintain customer satisfaction through immediate action, (3) filter and correlate the alarms, (4) diagnose faults through analysis and testing, (5) determine a plan for correction, display correction options to users, and implement the correction plan, (6) verify the fault is eliminated, (7) record data and determine the effectiveness of the current fault management function. The first step in fault management is to collect monitoring and performance alarms. Typically alarms are produced by either managed network elements (e.g., ATM switches, customer premise equipment) or by a statistical analysis of the network that monitors trends and threshold crossings. Alarms can be classified into two categories, physical and logical, where physical alarms are hard errors (e.g., a link is down), typically reported through an element manager, and logical alarms are statistical errors (e.g., performance degradation due to congestion). Once the alarms have been reported and collected, adequate service must be maintained through immediate action. This action serves as a temporary stop gap while the fault diagnosis process proceeds, in order to ensure the customer does not experience a loss or decrease in service. An example may be routing traffic in an opposite direction in the case of a SONET ring break, or in the case of a malfunctioning switch, rerouting around the problem area. ‡ This research was partly supported by Sprint Corporation and partly by internal funding at SRI International. The opinions expressed in this publication do not necessarily reflect a position or policy of Sprint. This document was created with FrameMaker 4.0.2 2 After customer satisfaction is ensured, the next step is to filter and correlate the alarms. Alarm filtering is a process that analyzes the multitude of alarms received and eliminates the redundant alarms (e.g., multiple occurrences of the same alarm),. Alarm correlation is the interpretation of multiple alarms such that new conceptual meanings can be assigned to the alarms, creating derived alarms. Faults are identified by analyzing the filtered and correlated alarms and by requesting tests and status updates from the element managers, which provide additional information for diagnosis. Once a fault has been diagnosed, corrective procedures may be undertaken by the network to eliminate the cause of the fault. The fault management system’s role in correction is to develop a plan or series of actions, and to initiate this plan with other functions within the network. As much of the correction as possible is performed automatically without human intervention, although at times it is necessary for a technician to physically go to a site to replace a part, or for a programmer to debug some software. The correction must be verified through testing requests sent to the element managers, where if the fault does not disappear, more data is analyzed and the diagnostic process is repeated. Another step in fault management is to collect data about the effectiveness of the fault management process in order to monitor damage perpetrated by faults and the costs of repair. As outlined in (Byrne, 1994), questions regarding how often faults are occurring and how many faults are affecting service should be normalized to account for network size and number of customers. Other questions regarding length of service interrupts, number of times a fault is correctly identified, and number of hours to repair, should be normalized according to the number of relevant faults detected. These statistics can be used to analyze the performance of the fault management system and can be used in other network management activities such as capacity planning in determining current and projected costs for the maintenance of the network. In addition, a finer grained analysis of the types of faults can shed some light on the reliability of different types of equipment. FILTER and CORRELATE ALARMS (Neural Networks or Bayesian DIAGNOSE FAULTS (Case-based Reasoning or Belief Networks) Expert Systems) DEVELOP and IMPLEMENT CORRECTIVE PLAN (Case-based Reasoning, Intelligent Planning, or Expert Systems) VERIFY FAULT IS ELIMINATED Fault not Eliminated PHYSICAL ALARMS LOGICAL ALARMS Figure 1: The fault management process and possible AI technologies. NETWORK COLLECT ALARMS TAKE ACTION to ENSURE CUSTOMER SATISFICATION Fault is Eliminated RECORD EVENTS and ANALYZE FAULT MANAGEMENT PERFORMANCE

[1]  Ashok K. Goel,et al.  Towards a 'neural' architecture for abductive reasoning , 1988, IEEE 1988 International Conference on Neural Networks.

[2]  J. L. Tsay,et al.  An autonomous distributed expert system for switched network maintenance , 1988, IEEE Global Telecommunications Conference and Exhibition. Communications for the Information Age.

[3]  Jie Chen,et al.  Fault diagnosis in nonlinear dynamic systems via neural networks , 1994 .

[4]  Salah Aidarous,et al.  Telecommunications Network Management into the 21st Century , 1995 .

[5]  채장수,et al.  검증위성의 Fault Management 설계 , 2005 .

[6]  Nathan J. Muller,et al.  INTEGRATED NETWORK MANAGEMENT , 2007 .