Analytical review on Hadoop Distributed file system

Hadoop Distributed file System is used for processing, storing and analyzing very large amount of unstructured data. It stores the data reliably and provides fault tolerance, fast and scalable access to the information. It is used with MapReduce, which is a programming model. HDFS and Map Reduce are the core components of Hadoop. Hadoop is a framework of tools for large scale computation and data processing of large data sets. As we know data and information is exponentially increasing in current era therefore the technology like Hadoop, Cassandra File System, etc became the preferred choice among the IT professionals and business communities. Hadoop Distributed File System is rapidly growing and proving itself as cutting edge technology in dealing with huge amount of structured and unstructured data. This paper includes step by step introduction to data management using file system, data management using RDBMS then need of Hadoop distributed file system, and its working process.

[1]  Jun Wang,et al.  Improving metadata management for small files in HDFS , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[2]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[3]  Ohad Rodeh,et al.  zFS - a scalable distributed file system using object disks , 2003, 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies, 2003. (MSST 2003). Proceedings..

[4]  Andre Oriani,et al.  From Backup to Hot Standby: High Availability for HDFS , 2012, 2012 IEEE 31st Symposium on Reliable Distributed Systems.

[5]  Shanping Li,et al.  A Request Skew Aware Heterogeneous Distributed Storage System Based on Cassandra , 2011, 2011 International Conference on Computer and Management (CAMAN).

[6]  A. L. Narasimha Reddy,et al.  Disk scheduling in a multimedia I/O system , 1993, MM 1993.

[7]  Michael Stonebraker,et al.  MapReduce and parallel DBMSs: friends or foes? , 2010, CACM.

[8]  Prashant J. Shenoy,et al.  Cello: A Disk Scheduling Framework for Next Generation Operating Systems* , 1998, SIGMETRICS '98/PERFORMANCE '98.

[9]  Xubin He,et al.  Implementing WebGIS on Hadoop: A case study of improving small file I/O performance on HDFS , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[10]  Shengjun Xue,et al.  A Novel Approach in Improving I/O Performance of Small Meteorological Files on HDFS , 2011 .

[11]  Ning Zhang,et al.  ERMS: An Elastic Replication Management System for HDFS , 2012, 2012 IEEE International Conference on Cluster Computing Workshops.

[12]  Lei Shi,et al.  A tile-based scalable raster data management system based on HDFS , 2012, 2012 20th International Conference on Geoinformatics.

[13]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[14]  Peter Druschel,et al.  Anticipatory scheduling: a disk scheduling framework to overcome deceptive idleness in synchronous I/O , 2001, SOSP.

[15]  Qinghua Zheng,et al.  A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: A Case Study by PowerPoint Files , 2010, 2010 IEEE International Conference on Services Computing.

[16]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.

[17]  Wenfeng Shen,et al.  A Novel Data Encryption in HDFS , 2013, 2013 IEEE International Conference on Green Computing and Communications and IEEE Internet of Things and IEEE Cyber, Physical and Social Computing.

[18]  Shijue Zheng,et al.  A metadata access strategy of learning resources based on HDFS , 2011, 2011 International Conference on Image Analysis and Signal Processing.

[19]  Jure Petrovic,et al.  Using Memcached for Data Distribution in Industrial Environment , 2008, Third International Conference on Systems (icons 2008).

[20]  Guangwen Yang,et al.  Using Memcached to Promote Read Throughput in Massive Small-File Storage System , 2010, 2010 Ninth International Conference on Grid and Cloud Computing.

[21]  K. Chitharanjan,et al.  Locality Sensitive Hashing based incremental clustering for creating affinity groups in Hadoop — HDFS - An infrastructure extension , 2013, 2013 International Conference on Circuits, Power and Computing Technologies (ICCPCT).

[22]  Jian Feng Zhang,et al.  Dynamic Location Allocation Optimization for PDP Panel Aging Stereoscopic Warehouse , 2011 .

[23]  Prashant J. Shenoy,et al.  Cello: A Disk Scheduling Framework for Bext Generation Operating Systems , 1998, SIGMETRICS.

[24]  Donald F. Towsley,et al.  Performance evaluation of two new disk scheduling algorithms for real-time systems , 2004, Real-Time Systems.

[25]  Yun Tian,et al.  Improving MapReduce performance through data placement in heterogeneous Hadoop clusters , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[26]  Susan M. Drake A Novel Approach. , 1996 .

[27]  GhemawatSanjay,et al.  The Google file system , 2003 .

[28]  Eunmi Choi,et al.  A Taxonomy and Survey on Distributed File Systems , 2008, 2008 Fourth International Conference on Networked Computing and Advanced Information Management.

[29]  Xindong Wu,et al.  A Distributed Cache for Hadoop Distributed File System in Real-Time Cloud Services , 2012, 2012 ACM/IEEE 13th International Conference on Grid Computing.

[30]  Parag Agrawal,et al.  The case for RAMClouds: scalable high-performance storage entirely in DRAM , 2010, OPSR.

[31]  Michael Stonebraker,et al.  A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.

[32]  Kai Fan,et al.  An Adaptive Feedback Load Balancing Algorithm in HDFS , 2013, 2013 5th International Conference on Intelligent Networking and Collaborative Systems.

[33]  Farag Azzedin Towards a scalable HDFS architecture , 2013, 2013 International Conference on Collaboration Technologies and Systems (CTS).

[34]  Marcin Zukowski,et al.  Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS , 2007, VLDB.

[35]  Liang Chen,et al.  The Dynamically Efficient Mechanism of HDFS Data Prefetching , 2013, 2013 IEEE International Conference on Green Computing and Communications and IEEE Internet of Things and IEEE Cyber, Physical and Social Computing.

[36]  Garth A. Gibson,et al.  Data-intensive File Systems for Internet Services: A Rose by Any Other Name... (CMU-PDL-08-114) , 2008 .

[37]  Parag Agrawal,et al.  The case for RAMCloud , 2011, Commun. ACM.

[38]  Meina Song,et al.  THE optimization of HDFS based on small files , 2010, 2010 3rd IEEE International Conference on Broadband Network and Multimedia Technology (IC-BNMT).

[39]  Alan L. Cox,et al.  The Hadoop distributed filesystem: Balancing portability and performance , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).