论文信息 - BIG DATA - IMPORTANCE OF HADOOP DISTRIBUTED FILESYSTEM

BIG DATA - IMPORTANCE OF HADOOP DISTRIBUTED FILESYSTEM

When a dataset outgrows the storage capacity of a single physical machine, it becomes necessary to partition it across a number of separate machines. Filesystems that manage the storage across a network of machines are called distributed filesystems. Since they are network-based, all the complications of network programming kick in, thus making distributed filesystems more complex than regular disk filesystems. It's not easy to measure the total volume of data stored electronically, but an IDC estimate put the size of the "digital universe" at 0.18 zettabytes in 2006, and is forecasting a tenfold growth by 2011 to 1.8 zettabytes.* A zettabyte is 1021 bytes, or equivalently one thousand exabytes, one million petabytes, or one billion terabytes. This flood of data is coming from many sources as mentioned below - • The New York Stock Exchange generates about one terabyte of new trade data perday. • Facebook hosts approximately 10 billion photos, taking up one petabyte of storage. • Ancestry.com, the genealogy site, stores around 2.5 petabytes of data. • The Internet Archive stores around 2 petabytes of data, and is growing at a rate of 20 terabytes per month. • The Large Hadron Collider near Geneva, Switzerland, will produce about 15 petabytes of data per year.

Ankur Chaudhary | Pankaj Singh

[1] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[2] Jingren Zhou,et al. SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[3] Carlos Maltzahn,et al. Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[4] GhemawatSanjay,et al. The Google file system , 2003 .

[5] H. Garcia-Molina,et al. Scheduling I/O requests with deadlines: A performance evaluation , 1990, [1990] Proceedings 11th Real-Time Systems Symposium.