THE HADOOP DISTRIBUTED FILE SYSTEM: BALANCING PORTABILTY

Hadoop is a software framework that supports data intensive distributed application. Hadoop creates clusters of machine and coordinates the work among them. It include two major component, HDFS (Hadoop Distributed File System) and MapReduce. HDFS is designed to store large amount of data reliably and provide high availability of data to user application running at client. It creates multiple data blocks and store each of the block redundantly across the pool of servers to enable reliable, extreme rapid computation. MapReduce is software framework for the analyzing and transforming a very large data set in to desired output. This paper focus on how the replicas are managed in HDFS for providing high availability of data under extreme computational requirement. Later this paper focus on possible failure that will affect the Hadoop cluster and which are failover mechanism can be deployed for protecting the cluster.

[1]  George Candea,et al.  A Scalable, Predictable Join Operator for Highly Concurrent Data Warehouses , 2009, Proc. VLDB Endow..

[2]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[3]  Robert B. Ross,et al.  PVFS: A Parallel File System for Linux Clusters , 2000, Annual Linux Showcase & Conference.

[4]  Sanjay Ghemawat,et al.  MapReduce: a flexible data processing tool , 2010, CACM.

[5]  H. Garcia-Molina,et al.  Scheduling I/O requests with deadlines: A performance evaluation , 1990, [1990] Proceedings 11th Real-Time Systems Symposium.

[6]  Wei-Kuan Shih,et al.  Deadline-modification-SCAN with maximum-scannable-groups for multimedia real-time disk scheduling , 1998, Proceedings 19th IEEE Real-Time Systems Symposium (Cat. No.98CB36279).