Recovery scopes, recovery groups, and fine-grained recovery in enterprise storage controllers with multi-core processors

In this paper we extend a previously published approach to error recovery in enterprise storage controllers with multi-core processors. Our approach first involves the partitioning of the set of tasks in the runtime of the controller software into clusters (recovery scopes) of dependent tasks. Then, these recovery scopes are mapped into a set of recovery groups, on which the scheduling of tasks, both during the recovery process and normal operation, is based. This recovery-aware scheduling (RAS) replaces the performance-based scheduling of the storage controller. Through simulation and benchmark experiments, we find that: 1) the performance of RAS appears to be critically dependent on the values of recovery-related parameters; and 2) our fine-grained recovery approach promises to enhance the storage system availability while keeping the additional overhead, and the resulting degradation in performance, under control.

[1]  Xiaoyun Zhu,et al.  Triage: Performance differentiation for storage systems using adaptive control , 2005, TOS.

[2]  Andrea C. Arpaci-Dusseau,et al.  Association Proceedings of the Third USENIX Conference on File and Storage Technologies San Francisco , CA , USA March 31 – April 2 , 2004 , 2004 .

[3]  Angelos D. Keromytis,et al.  Using Rescue Points to Navigate Software Recovery , 2007, 2007 IEEE Symposium on Security and Privacy (SP '07).

[4]  Hamid Pirahesh,et al.  ARIES: a transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging , 1998 .

[5]  Andreas Reuter,et al.  Transaction Processing: Concepts and Techniques , 1992 .

[6]  George Candea,et al.  Microreboot - A Technique for Cheap Recovery , 2004, OSDI.

[7]  Yuanyuan Zhou,et al.  Rx: treating bugs as allergies---a safe method to survive software failures , 2005, SOSP '05.

[8]  Peter J. Varman,et al.  pClock: an arrival curve based approach for QoS guarantees in shared storage systems , 2007, SIGMETRICS '07.

[9]  Yennun Huang,et al.  Software rejuvenation: analysis, module and applications , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[10]  Ling Liu,et al.  Enhancing Storage System Availability on Multi-Core Architectures with Recovery-Conscious Scheduling , 2008, FAST.

[11]  Bernhard Jansen,et al.  Architecting Dependable and Secure Systems Using Virtualization , 2007, WADS.

[12]  Kishor S. Trivedi Probability and Statistics with Reliability, Queuing, and Computer Science Applications , 1984 .

[13]  Daniel M. Roy,et al.  Enhancing Server Availability and Security Through Failure-Oblivious Computing , 2004, OSDI.

[14]  Brian Randell System structure for software fault tolerance , 1975 .

[15]  Jim Gray,et al.  Why Do Computers Stop and What Can Be Done About It? , 1986, Symposium on Reliability in Distributed Software and Database Systems.

[16]  Shaler Stidham,et al.  Technical Note - A Last Word on L = λW , 1974, Oper. Res..

[17]  Kishor S. Trivedi,et al.  On the analysis of software rejuvenation policies , 1997, Proceedings of COMPASS '97: 12th Annual Conference on Computer Assurance.