Data Availability and Durability with the Hadoop Distributed File System

The Hadoop Distributed File System at Yahoo! stores 40 petabytes of application data across 30,000 nodes . The most conventional strategy for data protection—just make a copy somewhere else—is not practical for such large data sets . To be a good custodian of this much data, HDFS must continuously manage the number of replicas for each block, test the integrity of blocks, balance the usage of resources as the hardware infrastructure changes, report status to administrators, and be on guard for the unexpected . Furthermore, the system administrators must ensure that thousands of hosts are operational, have network connectivity, and are executing the HDFS Name Node and Data Node applications . How well has this worked in practice? HDFS is (almost) always available and (almost) never loses data .

[1]  Konstantin V. Shvachko Apache Hadoop: The Scalability Update , 2011, login Usenix Mag..

[2]  Konstantin V. Shvachko,et al.  HDFS Scalability: The Limits to Growth , 2010, login Usenix Mag..

[3]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).