FAB: building distributed enterprise disk arrays from commodity components

This paper describes the design, implementation, and evaluation of a Federated Array of Bricks (FAB), a distributed disk array that provides the reliability of traditional enterprise arrays with lower cost and better scalability. FAB is built from a collection of bricks, small storage appliances containing commodity disks, CPU, NVRAM, and network interface cards. FAB deploys a new majority-voting-based algorithm to replicate or erasure-code logical blocks across bricks and a reconfiguration algorithm to move data in the background when bricks are added or decommissioned. We argue that voting is practical and necessary for reliable, high-throughput storage systems such as FAB. We have implemented a FAB prototype on a 22-node Linux cluster. This prototype sustains 85MB/second of throughput for a database workload, and 270MB/second for a bulk-read workload. In addition, it can outperform traditional master-slave replication through performance decoupling and can handle brick failures and recoveries smoothly without disturbing client requests.

[1]  David K. Gifford,et al.  Weighted voting for replicated data , 1979, SOSP '79.

[2]  Liuba Shrira,et al.  Efficient at-most-once messages based on synchronized clocks , 1989, Proceedings of the Second Workshop on Workstation Operating Systems.

[3]  Hagit Attiya,et al.  Sharing memory robustly in message-passing systems , 1990, PODC '90.

[4]  Maurice Herlihy,et al.  Linearizability: a correctness condition for concurrent objects , 1990, TOPL.

[5]  Liuba Shrira,et al.  Efficient at-most-once messages based on synchronized clocks , 1990, SIGCOMM 1990.

[6]  Liuba Shrira,et al.  Efficient at-most-once messages based on synchronized clocks , 1991, TOCS.

[7]  Shivakumar Venkataraman,et al.  The TickerTAIP parallel RAID architecture , 1993, ISCA '93.

[8]  Garth A. Gibson,et al.  RAID: high-performance, reliable secondary storage , 1994, CSUR.

[9]  David L. Mills,et al.  Improved algorithms for synchronizing computer network clocks , 1995, TNET.

[10]  William E. Weihl,et al.  Lottery scheduling: flexible proportional-share resource management , 1994, OSDI '94.

[11]  Frank B. Schmuck,et al.  Agreeing on Processor Group Membership in Timed Asynchronous Distributed Systems , 1995 .

[12]  Chandramohan A. Thekkath,et al.  Petal: distributed virtual disks , 1996, ASPLOS VII.

[13]  Robbert van Renesse,et al.  A Gossip-Style Failure Detection Service , 2009 .

[14]  Idit Keidar,et al.  Dynamic voting for consistent primary components , 1997, PODC '97.

[15]  James S. Plank,et al.  A tutorial on Reed–Solomon coding for fault‐tolerance in RAID‐like systems , 1997, Softw. Pract. Exp..

[16]  James S. Plank A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems , 1997 .

[17]  Avishai Wool,et al.  Quorum Systems in Replicated Databases: Science or Fiction? , 1998, IEEE Data Eng. Bull..

[18]  Jim Zelenka,et al.  A cost-effective, high-bandwidth storage architecture , 1998, ASPLOS VIII.

[19]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[20]  Barbara Liskov,et al.  Viewstamped Replication: A New Primary Copy Method to Support Highly-Available Distributed Systems , 1999, PODC '88.

[21]  David A. Patterson,et al.  Reducing the cost of system administration of a disk storage system built from commodity components , 2000 .

[22]  Leslie Lamport,et al.  Paxos Made Simple , 2001 .

[23]  Nancy A. Lynch,et al.  RAMBO: A Reconfigurable Atomic Memory Service for Dynamic Networks , 2002, DISC.

[24]  Miguel Castro,et al.  Farsite: federated, available, and reliable storage for an incompletely trusted environment , 2002, OPSR.

[25]  Alex A. Shvartsmanz Rambo: A Reconfigurable Atomic Memory Service for Dynamic Networks , 2002 .

[26]  Gregory R. Ganger,et al.  Self-* Storage: Brick-based Storage with Automated Administration (CMU-CS-03-178) , 2003 .

[27]  Michael K. Reiter,et al.  Efficient Consistency for Erasure-coded Data via Versioning Servers (CMU-CS-03-127) , 2003 .

[28]  Marcos K. Aguilera,et al.  Strict Linearizability and the Power of Aborting , 2003 .

[29]  Arif Merchant,et al.  FAB: Enterprise Storage Systems on a Shoestring , 2003, HotOS.

[30]  Ben Y. Zhao,et al.  Pond: The OceanStore Prototype , 2003, FAST.

[31]  Erik Riedel,et al.  More Than an Interface - SCSI vs. ATA , 2003, FAST.

[32]  Ben Y. Zhao,et al.  Awarded Best Student Paper! - Pond: The OceanStore Prototype , 2003 .

[33]  Arif Merchant,et al.  A decentralized algorithm for erasure-coded virtual disks , 2004, International Conference on Dependable Systems and Networks, 2004.

[34]  Julian Satran,et al.  Internet Small Computer Systems Interface (iSCSI) , 2004, RFC.

[35]  Armando Fox,et al.  Session State: Beyond Soft State , 2004, NSDI.

[36]  Franco Travostino,et al.  Internet Storage Name Service (iSNS) , 2005, RFC.