Data Analysis using Mapper and Reducer with Optimal Configuration in Hadoop

Data analysis is an important functionality in cloud computing which allows a huge amount of data to be processed over very large clusters. Hadoop is a software framework for large data analysis. It provide a Hadoop distributed file system for the analysis and transformation of very large data sets is performed using the MapReduce paradigm. MapReduce is known as a popular way to hold data in the cloud environment due to its excellent scalability and good fault tolerance. Map Reduce is a programming model widely used for processing large data sets. Hadoop Distributed File System is designed to stream those data sets. The Hadoop MapReduce system was often unfair in its allocation and a dramatic improvement is achieved through the Elastic Mapper Reducer System. The proposed Mapper Reducer function allows us to analyze the data set and achieve better performance in executing the job by using optimal configuration of mappers and reducers based on the size of the data sets and also helps the users to view the status of the job and to find the error localization of scheduled jobs. This will efficiently utilize the performance properties of optimized scheduled jobs. So, the efficiency of the system will result in substantially lowered system cost, energy usage, management complexity and increases the performance of the system. General TermsData analysis, Hadoop, HDFS, MapReduce Paradigm Keywords— Cloud Computing, Hadoop Distributed file System, Performance Paradigm.

[1]  Douglas Stott Parker,et al.  Map-reduce-merge: simplified relational data processing on large clusters , 2007, SIGMOD '07.

[2]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[3]  Wu-chun Feng,et al.  Enhancing MapReduce via Asynchronous Data Processing , 2010, 2010 IEEE 16th International Conference on Parallel and Distributed Systems.

[4]  Dorin Comaniciu,et al.  Mean Shift: A Robust Approach Toward Feature Space Analysis , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  Beng Chin Ooi,et al.  The performance of MapReduce , 2010, Proc. VLDB Endow..

[6]  Michael D. Ernst,et al.  HaLoop , 2010, Proc. VLDB Endow..

[7]  Vaibhav Kohli,et al.  Big Data Processing using Apache Hadoop in Cloud System , 2012 .

[8]  Matei Zaharia,et al.  Job Scheduling for Multi-User MapReduce Clusters , 2009 .

[9]  Jeffrey D. Ullman,et al.  Optimizing Multiway Joins in a Map-Reduce Environment , 2011, IEEE Transactions on Knowledge and Data Engineering.

[10]  Anthony K. H. Tung,et al.  MAP-JOIN-REDUCE: Toward Scalable and Efficient Data Analysis on Large Clusters , 2011, IEEE Transactions on Knowledge and Data Engineering.

[11]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[12]  Indranil Palit,et al.  Scalable and Parallel Boosting with MapReduce , 2012, IEEE Transactions on Knowledge and Data Engineering.

[13]  Feng Li,et al.  An Efficient Hierarchical Clustering Method for Large Datasets with Map-Reduce , 2009, 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies.

[14]  Jeffrey D. Ullman,et al.  Optimizing joins in a map-reduce environment , 2010, EDBT '10.