Fatman: Cost-saving and reliable archival storage based on volunteer resources

We present Fatman, an enterprise-scale archival storage based on volunteer contribution resources from underutilized web servers, usually deployed on thousands of nodes with spare storage capacity. Fatman is specifically designed for enhancing the utilization of existing storage resources and cutting down the hardware purchase cost. Two major concerned issues of the system design are maximizing the resource utilization of volunteer nodes without violating Service Level Objectives (SLOs) and minimizing the cost without reducing the availability of archival system. Fatman has been widely deployed on tens of thousands of server nodes across several datacenters, provided more than 100PB storage capacity and served dozens of internal mass-data applications. The system realizes an efficient storage quota consolidation by strong isolation and budget limitation, to maximally support resources contribution without any degradation on host-level SLOs. It firstly improves data reliability by applying disk failure prediction to minish failure recovery cost, named fault-aware data management, dramatically reduces the MTTR by 76.3% and decreases file crash ratio by 35% on real-life product workload.

[1]  Vijay S. Pande,et al.  Folding@Home and Genome@Home: Using distributed computing to tackle previously intractable problem , 2009, 0901.0866.

[2]  Michael Vrable,et al.  BlueSky: a cloud-backed file system for the enterprise , 2012, FAST.

[3]  GhemawatSanjay,et al.  The Google file system , 2003 .

[4]  Joseph F. Murray,et al.  Improved disk-drive failure warnings , 2002, IEEE Trans. Reliab..

[5]  R. S. Fabry,et al.  A fast file system for UNIX , 1984, TOCS.

[6]  Michael Vrable,et al.  Cumulus: Filesystem backup to the cloud , 2009, TOS.

[7]  Joseph F. Murray,et al.  Machine Learning Methods for Predicting Failures in Hard Drives: A Multiple-Instance Application , 2005, J. Mach. Learn. Res..

[8]  Ethan L. Miller,et al.  Disk infant mortality in large storage systems , 2005, 13th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems.

[9]  Nouman M. Durrani,et al.  Volunteer computing: requirements, challenges, and solutions , 2014, J. Netw. Comput. Appl..

[10]  Mark D. Corner,et al.  TFS: A Transparent File System for Contributory Storage , 2007, FAST.

[11]  Dimitris S. Papailiopoulos,et al.  Optimal locally repairable codes and connections to matroid theory , 2013, 2013 IEEE International Symposium on Information Theory.

[12]  Dina Bitton,et al.  Disk Shadowing , 1988, VLDB.

[13]  Dimitris S. Papailiopoulos,et al.  XORing Elephants: Novel Erasure Codes for Big Data , 2013, Proc. VLDB Endow..

[14]  Cheng Huang,et al.  Erasure Coding in Windows Azure Storage , 2012, USENIX Annual Technical Conference.

[15]  Larry L. Peterson,et al.  Container-based operating system virtualization: a scalable, high-performance alternative to hypervisors , 2007, EuroSys '07.

[16]  James S. Plank,et al.  A tutorial on Reed–Solomon coding for fault‐tolerance in RAID‐like systems , 1997, Softw. Pract. Exp..

[17]  Rodrigo Rodrigues,et al.  High Availability in DHTs: Erasure Coding vs. Replication , 2005, IPTPS.

[18]  Arkady Kanevsky,et al.  Are disks the dominant contributor for storage failures?: A comprehensive study of storage subsystem failure characteristics , 2008, TOS.

[19]  Cheng Huang,et al.  Rethinking erasure codes for cloud file systems: minimizing I/O for recovery and degraded reads , 2012, FAST.

[20]  Kashi Venkatesh Vishwanath,et al.  Characterizing cloud computing hardware reliability , 2010, SoCC '10.

[21]  Greg Hamerly,et al.  Bayesian approaches to failure prediction for disk drives , 2001, ICML.

[22]  Gang Wang,et al.  Proactive drive failure prediction for large scale storage systems , 2013, 2013 IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST).

[23]  Bianca Schroeder,et al.  Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You? , 2007, FAST.

[24]  Jehan-Francois Pâris EVALUATING THE RELIABILITY OF STORAGE SYSTEMS , 2006 .

[25]  Luiz André Barroso,et al.  The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines , 2009, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines.

[26]  Garth A. Gibson,et al.  RAID: high-performance, reliable secondary storage , 1994, CSUR.