论文信息 - A New HDFS Structure Model to Evaluate The Performance of Word Count Application on Different File Size

A New HDFS Structure Model to Evaluate The Performance of Word Count Application on Different File Size

is a powerful distributed processing model for large datasets. Hadoop is an open source framework and implementation of MapReduce. Hadoop distributed file system (HDFS) has become very popular to build large scale and high performance distributed data processing system. HDFS is designed mainly to handle big size files, so the processing of massive small files is a challenge in native HDFS. This paper focuses on introducing an approach to optimize the performance of processing of massive small files on HDFS. We design a new HDFS structure model which main idea is to merge the small files and write the small files at source direct into merged file. Experimental results show that the proposed scheme can improve the storage and access efficiencies of massive small files effectively on HDFS. KeywordsMapReduce, HDFS, Big data, Cluster

Mehedi Hasan | Mohammad Badrul Alam Miah | Md. Kamal Uddin

[1] Ganggang Zhang,et al. Improving the Efficiency of Storing for Small Files in HDFS , 2012, 2012 International Conference on Computer Science and Service System.

[2] Qinghua Zheng,et al. An optimized approach for storing and accessing small files on cloud storage , 2012, J. Netw. Comput. Appl..

[3] Nandan Mirajkar,et al. Perform wordcount Map-Reduce Job in Single Node Apache Hadoop cluster and compress data using Lempel-Ziv-Oberhumer (LZO) algorithm , 2013, ArXiv.

[4] Chuck Lam,et al. Hadoop in Action , 2010 .

[5] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.