Impact of Hard and Soft Failures on Availability of Enterprise Storage Systems

Enterprise storage systems have been an integral part of any high-end computing systems. Availability of storage systems is essential to timely provide requested data to servers. Any service disruption, caused by either data loss or data unavailability, can be very costly for enterprise applications such as e-banking or eshopping. Component failure is one of major causes of data loss or data unavailability. Although data loss can be prevented by data protection techniques such as remote mirroring, snapshots, and backups, these techniques have been less effective to avoid data unavailability. In this work, we investigate the impact of hard and soft failures of processor cores on the overall availability of enterprise storage systems. We use an analytical technique based on Markov model to estimate duration and number of system downtime events.

[1]  Anand Sivasubramaniam,et al.  BlueGene/L Failure Analysis and Prediction Models , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[2]  William H. Sanders,et al.  Designing dependable storage solutions for shared application environments , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[3]  William H. Sanders,et al.  Scaling file systems to support petascale clusters: A dependability analysis to support informed design choices , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[4]  Mehdi Baradaran Tahoori,et al.  A Field Analysis of System-level Effects of Soft Errors Occurring in Microprocessors used in Information Systems , 2008, 2008 IEEE International Test Conference.