Zen and the Art of Railway Maintenance: Analysis and Optimization of Maintenance via Fault Trees and Statistical Model Checking

Maintenance is crucial for the operation of modern systems. Timely inspections, repairs, and replacements help to prevent costly failures and downtime, and ensure that systems continue to function properly and safely. At the same time, this maintenance is costly. It requires staff, spare parts, and often downtime while inspections or repairs are being performed. Too much maintenance means wasting money, reducing the overall usefulness of the system, and even risking accidents due to improper maintenance. It is therefore important to find a good maintenance policy that balances cost and dependability. To achieve this balance, one must understand how a system wears out over time, and what the effects are of various actions to remove or prevent this wear. This thesis presents fault maintenance trees (FMTs), a novel formalism to allow the quantitative analysis of the effects of maintenance on costs and system dependability, to support the analysis and improvement of maintenance policies. FMTs are based on the industry-standard formalism of fault trees (FTs), which have long been used to study the reliability of safety-critical systems such as nuclear power plants and airplanes. FTs have been used since the 1960s, and a wide range of extensions and variants have been developed. These support the analysis of systems with time-dependent failures, uncertainty of failure probabilities, and various other properties. The first part of this thesis provides an overview of the jungle of fault tree extensions, surveying over 150 papers on the topic. The second part of this thesis introduces FMTs, which augment fault trees by including maintenance actions such as inspections and component replacements. With this information, we can calculate the probability of a system failure given a specific maintenance plan. FMTs also include information about the costs of different maintenance actions and failures, allowing one to calculate the expected total costs for a given policy. Thus, FMTs allow the comparison of different maintenance policies with respect to their effects on system reliability and cost, supporting the choice of the policy that best balances the two. Technically, FMTs are analysed using statistical model checking (SMC), a state-of-the-art technique to analyse complex systems without the excessive memory requirements of many other analysis techniques for extended FTs. SMC allows us to compute statistically justified confidence intervals on quantitative metrics such as cost, system reliability, and expected number of failures over time. SMC works well for many systems, but has a drawback that is particularly noticeable in our setting: Accurate estimates of low probabilities can take a long time to compute. We therefore provide a second analysis technique based on the recently developed Path-ZVA algorithm for rare event simulation. While this technique is currently limited to computing the average system availability, it requires much less computation time than SMC does for high-availability systems, without losing the statistical guarantees that SMC provides. Finally, we want FMTs to be applicable in a practical setting. To this end, the third part of this thesis presents two case studies from the railway industry: an electrically insulated railway joint, and a pneumatic compressor. These case studies were performed in close collaboration with our industrial partners, and demonstrate that FMTs can accurately model real-life systems and maintenance policies, and provide insights to help improve maintenance plans.