A Recovery Conscious Framework for Fault Resilient Storage Systems

This paper presents a recovery-conscious framework for improving the fault resiliency and recovery efficiency of highly concurrent embedded storage software systems. Our framework consists of a three-tier architecture and a suite of recovery conscious techniques. In the top tier, we promote the fine-grained recovery at the task level by introducing recovery scopes to model recovery dependencies between tasks. At the middle tier we develop highly effective groupings of recovery scopes into recovery groups based on system and workload characteristics. We study how to distribute recovery scopes between recovery groups and schedule recovery groups effectively in a multi-core storage system through a careful tuning of recovery-efficiency sensitive parameters. At the bottom tier, advocate the use of recovery-conscious scheduling instead of performance oriented scheduling to provide high recovery efficiency without sacrificing system performance. An important question to address in this tier is under which combinations of resource pools and recovery groups, the recovery-conscious scheduling outperforms the performance oriented scheduling. Our techniques have been implemented on a real industry-standard storage system. Experimental results show that the right choice of recovery-sensitive parameters is critical and our techniques are effective, non-intrusive and can significantly boost system resilience while delivering high performance under a variety of system configurations.

[1]  M. Hartung IBM TotalStorage Enterprise Storage Server: A designer's view , 2003, IBM Syst. J..

[2]  George Candea,et al.  Microreboot - A Technique for Cheap Recovery , 2004, OSDI.

[3]  Peter J. Varman,et al.  pClock: an arrival curve based approach for QoS guarantees in shared storage systems , 2007, SIGMETRICS '07.

[4]  Computer Staff,et al.  Transaction processing , 1994 .

[5]  Angelos D. Keromytis,et al.  Using Rescue Points to Navigate Software Recovery , 2007, 2007 IEEE Symposium on Security and Privacy (SP '07).

[6]  Peter Crowhurst,et al.  Ibm totalstorage enterprise storage server model 800 , 2002 .

[7]  Andreas Reuter,et al.  Transaction Processing: Concepts and Techniques , 1992 .

[8]  Xiaoyun Zhu,et al.  Triage: Performance differentiation for storage systems using adaptive control , 2005, TOS.

[9]  Jason Nieh,et al.  Grouped distributed queues: distributed queue, proportional share multiprocessor scheduling , 2006, PODC '06.

[10]  Jim Gray,et al.  Why Do Computers Stop and What Can Be Done About It? , 1986, Symposium on Reliability in Distributed Software and Database Systems.

[11]  Hamid Pirahesh,et al.  ARIES: a transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging , 1998 .

[12]  Yennun Huang,et al.  Software rejuvenation: analysis, module and applications , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[13]  Brian Randell,et al.  System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.

[14]  Andrea C. Arpaci-Dusseau,et al.  Association Proceedings of the Third USENIX Conference on File and Storage Technologies San Francisco , CA , USA March 31 – April 2 , 2004 , 2004 .

[15]  Raymond T. Yeh,et al.  Proceedings of the international conference on Reliable software , 1975 .

[16]  Kishor S. Trivedi Probability and Statistics with Reliability, Queuing, and Computer Science Applications , 1984 .

[17]  Daniel M. Roy,et al.  Enhancing Server Availability and Security Through Failure-Oblivious Computing , 2004, OSDI.

[18]  Shaler Stidham,et al.  Technical Note - A Last Word on L = λW , 1974, Oper. Res..

[19]  M. Schunter,et al.  Architecting Dependable Systems Using Virtualization , 2007 .

[20]  Ling Liu,et al.  Enhancing Storage System Availability on Multi-Core Architectures with Recovery-Conscious Scheduling , 2008, FAST.

[21]  Kishor S. Trivedi,et al.  On the analysis of software rejuvenation policies , 1997, Proceedings of COMPASS '97: 12th Annual Conference on Computer Assurance.