Scalable and efficient distributed self-healing with self-optimization features in fixed IP networks

The Internet is continuously gaining importance in our society. Indeed, the Internet is slowly turning into the backbone of the modern world, having impact on all possible aspects, such as politics, communication, intercultural exchange, and emergency services, to give some examples. As these aspects are developing, the technical infrastructure around the Internet’s core protocol IP (Internet Protocol) is increasingly exposed to various challenges. One of these challenges is given by the requirement for sophisticated resilience mechanisms that can guarantee the robustness of the IP infrastructure in case of faults, failures, and natural disasters. This is of paramount importance for two reasons. First, Internet type of networks are to be deployed in the area of safety critical systems, such as emergency services (e.g. VoIP emergency services) that will be used in the course of network recovery after a disaster event (e.g. flooding, acts of terrorism, etc.). Secondly, the more robust the infrastructure, the higher the services’ availability and correspondingly the revenue for the service provider. Thus, resilience and robustness turn out also to be desired features beyond the above mentioned safety use case. This dissertation aims to develop a new architectural framework for improving the resilience of network nodes in fixed IP network infrastructures, i.e. IP networks without any mobility and continuously changing physical topology. The current thesis approaches the topic of resilience from two different perspectives. First, it is recognized that resilient self-healing mechanisms are already embedded inside diverse network protocols, as well as in applications and services running on top of a fixed IP network. Secondly, the importance of network and systems management processes for the availability of the network and IT infrastructure is also analyzed. This leads to the identification of a gap between the resilient features which are intrinsically embedded inside the protocols and applications, on one hand, and the network and systems management processes, on the other hand. This gap is constituted by the lack of a framework that runs on top of the protocols and applications and manages them with respect to incidents, thereby automating aspects of the established management standards. In addition, this framework is meant to serve as a layer between the network/system’s administrator and the networked infrastructure. That is, on one hand, the framework is configured and provided with knowledge by the human experts tweaking and improving the system. On the other hand, the framework is designed to escalate faulty conditions, which it is not able to resolve, to the operations personnel, such that responsive managerial actions can be initiated. The architectural framework consists of software components that operate in a distributed manner inside the nodes of the networked system in question. These software components are able to proactively and reactively respond to faulty conditions, i.e. on one hand failures are predicted and avoided, and on the other hand, an automatic response to already existing faulty conditions is realized. Correspondingly, existing mechanisms to realize these processes are evaluated, and where required new algorithms are developed for the proposed framework for instance scalable Markov Chain based Fault-Isolation or efficient self-optimization and action synchronization

[1]  Ina Schieferdecker,et al.  Framework for Ensuring Runtime Stability of Control Loops in Multi-agent Networked Environments , 2014, Trans. Comput. Sci..

[2]  Paulvanna Nayaki Marimuthu,et al.  Managing Enterprise Network Resilience Through the Mimicking of Bio-Organisms , 2016, WorldCIST.

[3]  Yang Ran,et al.  Considerations and suggestions on improvement of communication network disaster countermeasures after the wenchuan earthquake , 2011, IEEE Communications Magazine.

[4]  V. Yakovyna,et al.  Software Reliability Assessment Using High-Order Markov Chains , 2014 .

[5]  Arun Prakash,et al.  Addressing Stability of Control-Loops in the Context of the GANA Architecture: Synchronization of Actions and Policies , 2009, IWSOS.

[6]  Adrian Paschke Provalets: Component-Based Mobile Agents as Microservices for Rule-Based Data Access, Processing and Analytics , 2016, Bus. Inf. Syst. Eng..

[7]  Michael G. Hinchey,et al.  The ASSL approach to specifying self‐managing embedded systems , 2012, Concurr. Comput. Pract. Exp..

[8]  David Harle,et al.  Network Resilience in Multilayer Networks: A Critical Review and Open Issues , 2001, ICN.

[9]  Arun Prakash,et al.  Auto-Collaboration for optimal network resource utilization in fixed IPv6 networks , 2012, 2012 IEEE Globecom Workshops.

[10]  Xin Yan,et al.  Linear Regression Analysis: Theory and Computing , 2009 .

[11]  Wessel N. van Wieringen,et al.  On the mean squared error of the ridge estimator of the covariance and precision matrix , 2017 .

[12]  Víctor Manuel Ramos Ramos,et al.  SDN meets SDR in self-organizing networks: fitting the pieces of network management , 2016, IEEE Communications Magazine.

[13]  Huijun Gao,et al.  PCA and KPCA integrated Support Vector Machine for multi-fault classification , 2016, IECON 2016 - 42nd Annual Conference of the IEEE Industrial Electronics Society.

[14]  Xin Li,et al.  A Route Flap Suppression Mechanism Based on Dynamic Timers in OSPF Network , 2008, 2008 The 9th International Conference for Young Computer Scientists.

[15]  Goutam Kumar Saha Self-healing Software , 2007, UBIQ.

[16]  Andrea Westerinen,et al.  Implementation of the CIM Policy Model using PONDER , 2004, Proceedings. Fifth IEEE International Workshop on Policies for Distributed Systems and Networks, 2004. POLICY 2004..

[17]  Iuliana Teodorescu,et al.  Maximum Likelihood Estimation for Markov Chains , 2009, 0905.4131.

[18]  William Stallings,et al.  SNMP, SNMPv2, SNMPv3, and RMON 1 and 2 , 1999 .

[19]  Ina Schieferdecker,et al.  Robust architecture for distributed intelligence in an IP-based mobile wide-area surveillance system , 2014, The Journal of Supercomputing.

[20]  Pedro Sousa,et al.  Automated Network Resilience Optimization Using Computational Intelligence Methods , 2015, IDC.

[21]  Emil Vassev ASSL: Autonomic System Specification Language -- A Framework for Specification and Code Generation o , 2009 .

[22]  Andreas Pilz "Policy-Maker": a toolkit for policy-based security management , 2004, 2004 IEEE/IFIP Network Operations and Management Symposium (IEEE Cat. No.04CH37507).

[23]  Stanislav Shalunov,et al.  Detecting Duplex Mismatch on Ethernet , 2005, PAM.

[24]  Arun Prakash,et al.  Auto-configuration of OSPFv3 routing in fixed IPv6 networks , 2015, 2015 7th International Congress on Ultra Modern Telecommunications and Control Systems and Workshops (ICUMT).

[25]  Zheng Wang,et al.  An Architecture for Differentiated Services , 1998, RFC.

[26]  Malgorzata Steinder,et al.  A survey of fault localization techniques in computer networks , 2004, Sci. Comput. Program..

[27]  Christopher K. I. Williams,et al.  Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning) , 2005 .

[28]  John Strassner,et al.  DEN-ng: achieving business-driven network management , 2002, NOMS 2002. IEEE/IFIP Network Operations and Management Symposium. ' Management Solutions for the New Communications World'(Cat. No.02CH37327).

[29]  David Hutchison,et al.  Resilience and survivability in communication networks: Strategies, principles, and survey of disciplines , 2010, Comput. Networks.

[30]  Andrea Westerinen,et al.  Terminology for Policy-Based Management , 2001, RFC.

[31]  Sanjeev Singh,et al.  A Survey on Software Defined Networking: Architecture for Next Generation Network , 2016, Journal of Network and Systems Management.

[32]  Thiago Santini,et al.  Effectiveness of Software-Based Hardening for Radiation-Induced Soft Errors in Real-Time Operating Systems , 2017, ARCS.

[33]  Rajarshi Das,et al.  Utility functions in autonomic systems , 2004, International Conference on Autonomic Computing, 2004. Proceedings..

[34]  Laurent Philippe,et al.  A survey on parallel and distributed multi-agent systems for high performance computing simulations , 2016, Comput. Sci. Rev..

[35]  Gábor Rétvári,et al.  OSPF for Implementing Self-adaptive Routing in Autonomic Networks: A Case Study , 2009, MACE.

[36]  Dimitri N. Mavris,et al.  A Network-based Cost Comparison of Resilient and Robust System-of-Systems☆ , 2016 .

[37]  Arun Prakash,et al.  Formal Methods for Modeling, Refining and Verifying Autonomic Components of Computer Networks , 2012, Trans. Comput. Sci..

[38]  Lazaros G. Papageorgiou,et al.  Mathematical programming for piecewise linear regression analysis , 2016, Expert Syst. Appl..

[39]  Quang Tran Minh,et al.  Routing Optimization Model in Multihop Wireless Access Networks for Disaster Recovery , 2016, 2016 International Conference on Advanced Computing and Applications (ACOMP).

[40]  Inderveer Chana,et al.  QoS-Aware Autonomic Resource Management in Cloud Computing , 2015, ACM Comput. Surv..

[41]  Bruno Vidalenc,et al.  Towards a Unified Architecture for Resilience, Survivability and Autonomic Fault-Management for Self-managing Networks , 2009, ICSOC/ServiceWave Workshops.

[42]  Mohamad Khalil,et al.  Recognition of different daily living activities using hidden Markov model regression , 2016, 2016 3rd Middle East Conference on Biomedical Engineering (MECBME).

[43]  Luis Gerardo de la Fraga,et al.  Matlab-Simulink Co-Simulation , 2016 .

[44]  Ina Schieferdecker,et al.  Framework for distributed autonomic self-healing in fixed IPv6 networks , 2014, Int. J. Commun. Syst..

[45]  Arun Prakash,et al.  A Model-driven approach to design and verify autonomic network behaviors , 2011, 2011 IEEE GLOBECOM Workshops (GC Wkshps).

[46]  Brian Randell,et al.  Facing up to Faults , 2000 .

[47]  Xin Yao,et al.  Search biases in constrained evolutionary optimization , 2005, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[48]  Liang Gong,et al.  Integrating network function virtualization with SDR and SDN for 4G/5G networks , 2015, IEEE Network.

[49]  David Walker,et al.  Consistent updates for software-defined networks: change you can believe in! , 2011, HotNets-X.

[50]  Stefan Wallin,et al.  Telecom Network and Service Management: An Operator Survey , 2009, MMNS.

[51]  Ludovic Noirie,et al.  GMPLS adaptive level of recovery , 2012, 2012 IEEE International Conference on Communications (ICC).

[52]  Haibo He,et al.  Online Learning Control Using Adaptive Critic Designs With Sparse Kernel Machines , 2013, IEEE Transactions on Neural Networks and Learning Systems.

[53]  Yangsheng Xu,et al.  Hidden Markov model-based process monitoring system , 2004, J. Intell. Manuf..

[54]  Haibo Shen A Semantic-Aware Attribute-Based Access Control Model for Web Services , 2009, ICA3PP.

[55]  Nikolay Tcholtchev,et al.  Scalable Markov Chain Based Algorithm for Fault-Isolation in Autonomic Networks , 2010, 2010 IEEE Global Telecommunications Conference GLOBECOM 2010.

[56]  Saeid Rastegar,et al.  A novel robust control scheme for LTV systems using output integral discrete-time synergetic control theory , 2017, Eur. J. Control.

[57]  K. K. Ramakrishnan,et al.  Toward a software-based network: integrating software defined networking and network function virtualization , 2015, IEEE Network.

[58]  Bruno Vidalenc,et al.  Design and Evaluation of Techniques for Resilience and Survivability of the Routing Node , 2013, Int. J. Adapt. Resilient Auton. Syst..

[59]  Nikolay Tcholtchev,et al.  On Self-healing Based on Collaborating End-Systems, Access, Edge and Core Network Components , 2010, AccessNets.

[60]  Rudolf Hornig,et al.  An overview of the OMNeT++ simulation environment , 2008, Simutools 2008.

[61]  Yacine Rebahi,et al.  Addressing security issues in the autonomic Future Internet , 2011, 2011 IEEE Consumer Communications and Networking Conference (CCNC).

[62]  David Walker,et al.  Abstractions for network update , 2012, SIGCOMM '12.

[63]  Carlos Rodriguez-Fernández,et al.  Self-management capability requirements with SelfMML & INGENIAS to attain self-organising behaviours , 2010, SOAR '10.

[64]  J. Chris Oberg,et al.  Disasters will happen - are you ready? , 2011, IEEE Communications Magazine.

[65]  Robert W. Shirey,et al.  Internet Security Glossary , 2000, RFC.

[66]  Tao Chen,et al.  Latent Tree Models and Approximate Inference in Bayesian Networks , 2008, AAAI.

[67]  Brian Berenbach,et al.  A Literature Survey on International Standards for Systems Requirements Engineering , 2013, CSER.

[68]  Mehdi Rahmati,et al.  Estimation of the non records logs from existing logs using artificial neural networks , 2017 .

[69]  G. Swallow,et al.  SONET/SDH-like resilience for IP networks: a survey of traffic protection mechanisms , 2004, IEEE Network.

[70]  M. J. D. Powell,et al.  Direct search algorithms for optimization calculations , 1998, Acta Numerica.

[71]  Paulo Romero Martins Maciel,et al.  Dependability models for designing disaster tolerant cloud computing systems , 2013, 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[72]  Piet Demeester,et al.  Network Recovery: Protection and Restoration of Optical, SONET-SDH, IP, and MPLS , 2004 .

[73]  Paul Weber,et al.  HP OpenView System Administration Handbook: Network Node Manager, Customer Views, Service Information Portal, OpenView Operations , 2004 .

[74]  Ludovic Noirie,et al.  Dynamic risk-aware routing for OSPF networks , 2013, 2013 IFIP/IEEE International Symposium on Integrated Network Management (IM 2013).

[75]  John C. Strassner,et al.  Policy-based network management - solutions for the next generation , 2003, The Morgan Kaufmann series in networking.

[76]  Biswanath Mukherjee,et al.  IP resilience within an autonomous system: current approaches, challenges, and future directions , 2005, IEEE Communications Magazine.

[77]  Lars Grunske,et al.  Increasing Dependability of Component-Based Software Systems by Online Failure Prediction (Short Paper) , 2014, 2014 Tenth European Dependable Computing Conference.

[78]  Lei Xu,et al.  CogNet: A network management architecture featuring cognitive capabilities , 2016, 2016 European Conference on Networks and Communications (EuCNC).

[79]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[80]  John Wroclawski,et al.  The Use of RSVP with IETF Integrated Services , 1997, RFC.

[81]  Stefan Wallin Chasing a Definition of “Alarm” , 2009, Journal of Network and Systems Management.

[82]  Nasir Ghani,et al.  Progressive recovery for network virtualization after large-scale disasters , 2016, 2016 International Conference on Computing, Networking and Communications (ICNC).

[83]  Malgorzata Steinder,et al.  Probabilistic fault localization in communication systems using belief networks , 2004, IEEE/ACM Transactions on Networking.

[84]  Birgit Vogel-Heuser,et al.  Development of PLC-Based Software for Increasing the Dependability of Production Automation Systems , 2013, IEEE Transactions on Industrial Informatics.

[85]  Baran Çürüklü,et al.  Fault Tolerance Analysis for Dependable Autonomous Agents using Colored Time Petri Nets , 2017, ICAART.

[86]  Arun Prakash,et al.  Integrating the Modelica DSL into a Platform for Model-Based Tool Interoperability , 2014, 2014 IEEE 38th International Computer Software and Applications Conference Workshops.

[87]  Brian Randell,et al.  Dependability-a unifying concept , 1998, Proceedings Computer Security, Dependability, and Assurance: From Needs to Solutions (Cat. No.98EX358).

[88]  Frank Budinsky,et al.  EMF: Eclipse Modeling Framework 2.0 , 2009 .

[89]  G. Thompson,et al.  Optimal Control Theory: Applications to Management Science and Economics , 2000 .

[90]  Jacek Rak,et al.  Information-driven network resilience: Research challenges and perspectives , 2017, Opt. Switch. Netw..

[91]  Nikolay Tcholtchev,et al.  Autonomic Fault-Management and resilience from the perspective of the network operation personnel , 2010, 2010 IEEE Globecom Workshops.

[92]  Long Wang,et al.  Recent Advances in Consensus of Multi-Agent Systems: A Brief Survey , 2017, IEEE Transactions on Industrial Electronics.

[93]  Azer Bestavros,et al.  Verifiably-safe software-defined networks for CPS , 2013, HiCoNS '13.