SWEEPER: An Efficient Disaster Recovery Point Identification Mechanism

Data corruption is one of the key problems that is on top of the radar screen of most CIOs. Continuous Data Protection (CDP) technologies help enterprises deal with data corruption by maintaining multiple versions of data and facilitating recovery by allowing an administrator restore to an earlier clean version of data. The aim of the recovery process after data corruption is to quickly traverse through the backup copies (old versions), and retrieve a clean copy of data. Currently, data recovery is an ad-hoc, time consuming and frustrating process with sequential brute force approaches, where recovery time is proportional to the number of backup copies examined and the time to check a backup copy for data corruption. In this paper, we present the design and implementation of SWEEPER architecture and backup copy selection algorithms that specifically tackle the problem of quickly and systematically identifying a good recovery point. We monitor various system events and generate checkpoint records that help in quickly identifying a clean backup copy. The SWEEPER methodology dynamically determines the selection algorithm based on user specified recovery time and recovery point objectives, and thus, allows system administrators to perform trade-offs between recovery time and data currentness. We have implemented our solution as part of a popular Storage Resource Manager product and evaluated SWEEPER under many diverse settings. Our study clearly establishes the effectiveness of SWEEPER as a robust strategy to significantly reduce recovery time.

[1]  Dakshi Agrawal,et al.  Policy-based validation of SAN configuration , 2004, Proceedings. Fifth IEEE International Workshop on Policies for Distributed Systems and Networks, 2004. POLICY 2004..

[2]  Dirk Beyer,et al.  Designing for Disasters , 2004, FAST.

[3]  James Lee Hafner,et al.  Matrix methods for lost data reconstruction in erasure codes , 2005, FAST'05.

[4]  Angelos Bilas,et al.  Clotho: Transparent Data Versioning at the Block I/O Level , 2004, MSST.

[5]  Paula Ta-Shma,et al.  Architectures for Controller Based CDP , 2007, FAST.

[6]  David A. Patterson,et al.  A Flexible Architecture for Statistical Learning and Data Mining from System Log Streams , 2004 .

[7]  Craig A. N. Soules,et al.  Metadata Efficiency in Versioning File Systems , 2003, FAST.

[8]  Mohammad Banikazemi,et al.  Storage-based intrusion detection for storage area networks (SANs) , 2005, 22nd IEEE / 13th NASA Goddard Conference on Mass Storage Systems and Technologies (MSST'05).

[9]  Steven D. Gribble,et al.  Configuration Debugging as Search: Finding the Needle in the Haystack , 2004, OSDI.

[10]  Samuel T. King,et al.  Backtracking intrusions , 2003, SOSP '03.

[11]  Michael I. Jordan,et al.  A statistical learning approach to failure diagnosis , 2004 .

[12]  Tzi-cker Chiueh,et al.  Design, implementation, and evaluation of repairable file service , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[13]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[14]  Erez Zadok,et al.  A Versatile and User-Oriented Versioning File System , 2004, FAST.

[15]  Armando Fox,et al.  Detecting application-level failures in component-based Internet services , 2005, IEEE Transactions on Neural Networks.

[16]  Mary Baker,et al.  A fresh look at the reliability of long-term digital storage , 2005, EuroSys.

[17]  Craig A. N. Soules,et al.  Storage-based Intrusion Detection: Watching Storage Activity for Suspicious Behavior , 2003, USENIX Security Symposium.

[18]  Gautam Kar,et al.  Managing Virtual Storage Systems: An Approach Using Dependency Analysis , 2003, Integrated Network Management.