The Hadoop Distributed File System

The Hadoop Distributed File System (HDFS) is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user applications. In a large cluster, thousands of servers both host directly attached storage and execute user application tasks. By distributing storage and computation across many servers, the resource can grow with demand while remaining economical at every size. We describe the architecture of HDFS and report on experience using HDFS to manage 25 petabytes of enterprise data at Yahoo!.

[1]  Ken Thompson,et al.  The use of name spaces in Plan 9 , 1993, OPSR.

[2]  Jan K. Pachl,et al.  The per-process view of naming and remote execution , 1993, IEEE Parallel & Distributed Technology: Systems & Applications.

[3]  S. Radia Naming policies in the Spring system , 1994, Proceedings of IEEE Workshop on Services for Distributed and Networked Environments.

[4]  Robert B. Ross,et al.  PVFS: A Parallel File System for Linux Clusters , 2000, Annual Linux Showcase & Conference.

[5]  GhemawatSanjay,et al.  The Google file system , 2003 .

[6]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[7]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[8]  Garth A. Gibson,et al.  Data-intensive File Systems for Internet Services: A Rose by Any Other Name... (CMU-PDL-08-114) , 2008 .

[9]  Bin Zhou,et al.  Scalable Performance of the Panasas Parallel File System , 2008, FAST.

[10]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[11]  Sean Quinlan,et al.  GFS: Evolution on Fast-forward , 2009, ACM Queue.

[12]  Benjamin Reed,et al.  The life and times of a zookeeper , 2009, PODC '09.

[13]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[14]  Christopher Olston,et al.  Building a HighLevel Dataflow System on top of MapReduce: The Pig Experience , 2009, Proc. VLDB Endow..

[15]  Konstantin V. Shvachko,et al.  HDFS Scalability: The Limits to Growth , 2010, login Usenix Mag..