A Performance Analysis of MapReduce Task with Large Number of Files Dataset in Big Data Using Hadoop

Big Data is a huge amount of data that cannot be managed by the traditional data management system. Hadoop is a technological answer to Big Data. Hadoop Distributed File System (HDFS) and MapReduce programming model is used for storage and retrieval of the big data. The Tera Bytes size file can be easily stored on the HDFS and can be analyzed with MapReduce. This paper provides introduction to Hadoop HDFS and MapReduce for storing large number of files and retrieve information from these files. In this paper we present our experimental work done on Hadoop by applying a number of files as input to the system and then analyzing the performance of the Hadoop system. We have studied the amount of bytes written and read by the system and by the MapReduce. We have analyzed the behavior of the map method and the reduce method with increasing number of files and the amount of bytes written and read by these tasks.

[1]  Raymond Gardiner Goss,et al.  Heading towards big data building a better data warehouse for more data, more speed, and more users , 2013, ASMC 2013 SEMI Advanced Semiconductor Manufacturing Conference.

[2]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[3]  Zibin Zheng,et al.  Service-Generated Big Data and Big Data-as-a-Service: An Overview , 2013, 2013 IEEE International Congress on Big Data.

[4]  Aditya B. Patel,et al.  Addressing big data problem using Hadoop and Map Reduce , 2012, 2012 Nirma University International Conference on Engineering (NUiCONE).

[5]  Anirban Mukherjee,et al.  Shared disk big data analytics with Apache Hadoop , 2012, 2012 19th International Conference on High Performance Computing.

[6]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.