An Online Controller Towards Self-Adaptive File System Availability and Performance

At the present time, it can be a significant challenge to build a large-scale distributed file system that simultaneously maintains both high availability and high performance. Although many fault tolerance technologies have been proposed and used in both commercial and academic distributed file systems to achieve high availability, most of them typically sacrifice performance for higher system availability. Additionally, recent studies show that system availability and performance are related to the system workload. In this paper, we analyze the correlations among availability, performance, and workloads based on a replication strategy, and we discuss the trade off between availability and performance with different workloads. Our analysis leads to the design of an online controller that can dynamically achieve optimal performance and availability by tuning the system replication policy.

[1]  Mark S. Squillante,et al.  Failure data analysis of a large-scale heterogeneous server environment , 2004, International Conference on Dependable Systems and Networks, 2004.

[2]  Scott A. Brandt,et al.  Reliability mechanisms for very large storage systems , 2003, 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies, 2003. (MSST 2003). Proceedings..

[3]  Kavitha Ranganathan,et al.  Improving Data Availability through Dynamic Model-Driven Replication in Large Peer-to-Peer Communities , 2002, 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGRID'02).

[4]  Zhao Li,et al.  Evaluating Web software reliability based on workload and failure data extracted from server logs , 2004, IEEE Transactions on Software Engineering.

[5]  Peter Honeyman,et al.  Performance and Availability Tradeoffs in Replicated File Systems , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[6]  David J. DeWitt,et al.  A performance study of three high availability data replication strategies , 2005, Distributed and Parallel Databases.

[7]  Bianca Schroeder,et al.  Understanding failures in petascale computers , 2007 .

[8]  Robbert van Renesse,et al.  Chain Replication for Supporting High Throughput and Availability , 2004, OSDI.

[9]  Michael Grottke,et al.  Software Reliability Model Study , 2001 .

[10]  Bo Hong,et al.  File System Workload Analysis For Large Scientific Computing Applications , 2004, MSST.

[11]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[12]  Joseph L. Hellerstein,et al.  Using Control Theory to Achieve Service Level Objectives In Performance Management , 2001, 2001 IEEE/IFIP International Symposium on Integrated Network Management Proceedings. Integrated Network Management VII. Integrated Management Strategies for the New Millennium (Cat. No.01EX470).

[13]  GhemawatSanjay,et al.  The Google file system , 2003 .

[14]  Shan Gao,et al.  An On-Line Reorganization Framework for SAN File Systems , 2006, ADBIS.

[15]  Chen-Khong Tham,et al.  Analysis and optimization of service availability in a HA cluster with load-dependent machine availability , 2007, IEEE Transactions on Parallel and Distributed Systems.

[16]  Roger Wattenhofer,et al.  Large-scale simulation of replica placement algorithms for a serverless distributed file system , 2001, MASCOTS 2001, Proceedings Ninth International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[17]  John Kubiatowicz,et al.  Erasure Coding Vs. Replication: A Quantitative Comparison , 2002, IPTPS.

[18]  David J. DeWitt,et al.  A performance study of three high availability data replication strategies , 1991, [1991] Proceedings of the First International Conference on Parallel and Distributed Information Systems.

[19]  Randy H. Katz,et al.  A case for redundant arrays of inexpensive disks (RAID) , 1988, SIGMOD '88.