Software fault tolerance in telecommunications systems

1. I n t r o d u c t i o n The telephone system in many countries including the United States depends on stored program control switching systems to establish calls between millions of users. The programs for these systems are large and expensive to develop and maintain. The size and complexity of the programs arise from various requirements [1]: • high functionality • strict real-time constraints (30 msec response time to a request) • high traffic levels (100-200 call attempts/see) • a variety of hardware devices to be controlled • very high availability (down 2 hours in 40 years) and reliability (2 lost calls in 10,000) requirements • ability to run unattended for long periods. Complexity will increase rapidly with the network trend toward a more standard and open software environment, supporting: • more complex interactions among network elements such as switching nodes which perform call processing, signaling transfer points which allow switching nodes to communicate, and service control points which control data accessed by switching nodes • more equipment suppliers, leading to greater system heterogeneity • greater complexity within switching nodes, due to open interfaces and multiple application developers within a single system (instead of just one or two as in the past) • highly distributed intelligence and rapid introduction of new features and services. Reliability has been a major issue in tile design of telecommunications systems. Hardware fault tolerance is well understood, and arbitrarily small hardware error probabilities can be obtained by combining techniques such as modular fail-stop design, redundancy, automatic reconfiguration and

[1]  Richard J. Lipton,et al.  New Directions In Testing , 1989, Distributed Computing And Cryptography.

[2]  Jim Gray,et al.  Why Do Computers Stop and What Can Be Done About It? , 1986, Symposium on Reliability in Distributed Software and Database Systems.

[3]  Algirdas Avizienis,et al.  The N-Version Approach to Fault-Tolerant Software , 1985, IEEE Transactions on Software Engineering.

[4]  Syed R. Ali Analysis of Total Outage Data for Stored Program Control Switching Systems , 1986, IEEE J. Sel. Areas Commun..

[5]  Brian Randell System Structure for Software Fault Tolerance , 1975, IEEE Trans. Software Eng..

[6]  G. Herman,et al.  The feature interaction problem in telecommunications systems , 1989 .

[7]  Joe Armstrong,et al.  ERLANG - an experimental telephony programming language , 1990, International Symposium on Switching.