Handling Big Data Efficiently by Using Map Reduce Technique

Extremely large amount of data is being captured by today's organizations and is continue to increase. It becomes computationally inefficient to analyze such huge data. Researchers has addressed problem in discovering knowledge from these continuously growing large data sets. Quantity of available raw data has been increasing at a very high rate. The precious information is concealed in large databases. Data mining has become an interesting area to extract the embedded precious information from them. For many years it has been found its root in all kinds of application areas. Thus, gave evolution to many data mining methods which started to get applied in several real life fields. But not all the methods possess the capability to deal with and handle the huge collection of data. In recent years, numbers of computation and data intensive scientific data analyses are established. To perform the large scale data mining analyses so as to meet the scalability and performance requirements of big data, several efficient parallel and concurrent algorithms got applied. A lot of parallel algorithms are put into action using different parallelization techniques, such as-threads, MPI, MapReduce etc. Which yield different performance and usability characteristics. The MPI model works efficiently in computing rigorous problems but it is a complicated task to bring this model into the practical use. There is currently considerable enthusiasm around the MapReduce paradigm for large-scale data analysis. It is inspired by functional programming which allows expressing distributed computations on massive amounts of data. It is designed for large-scale data processing as it allows to run on clusters of commodity hardware. A prominent parallel data processing tool MapReduce is gaining significant momentum from both industry and academia as the volume of data to analyze grows rapidly. In this paper, we are going to work around MapReduce, its advantages, disadvantages and how it can be used in integration with other technology.

[1]  Sanjay Ghemawat,et al.  MapReduce: a flexible data processing tool , 2010, CACM.

[2]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[3]  Roberto J. Bayardo,et al.  PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce , 2009, Proc. VLDB Endow..

[4]  Magdalena Balazinska,et al.  Analyzing massive astrophysical datasets: Can Pig/Hadoop or a relational DBMS help? , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[5]  Rajeev Gandhi,et al.  An Analysis of Traces from a Production MapReduce Cluster , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[6]  Jignesh M. Patel,et al.  A comparison of join algorithms for log processing in MaPreduce , 2010, SIGMOD Conference.

[7]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[8]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[9]  Yon Dohn Chung,et al.  Parallel data processing with MapReduce: a survey , 2012, SGMD.

[10]  Michael Stonebraker,et al.  "One Size Fits All": An Idea Whose Time Has Come and Gone (Abstract) , 2005, ICDE.

[11]  Beng Chin Ooi,et al.  The performance of MapReduce , 2010, Proc. VLDB Endow..

[12]  Michael Stonebraker,et al.  MapReduce and parallel DBMSs: friends or foes? , 2010, CACM.

[13]  D. DeWitt MapReduce: A major step backwards | The Database Column , 2011 .

[14]  Daniela Florescu,et al.  Rethinking cost and performance of database systems , 2009, SGMD.

[15]  Henrik Loeser,et al.  "One Size Fits All": An Idea Whose Time Has Come and Gone? , 2011, BTW.

[16]  Christoforos E. Kozyrakis,et al.  Phoenix rebirth: Scalable MapReduce on a large-scale shared-memory system , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[17]  Christoforos E. Kozyrakis,et al.  Evaluating MapReduce for Multi-core and Multiprocessor Systems , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[18]  Michael Stonebraker,et al.  A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.

[19]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[20]  Michael Stonebraker,et al.  MapReduce: A major step backwards , 2014 .

[21]  Benjamin Rose,et al.  Supporting MapReduce on large-scale asymmetric multi-core clusters , 2009, OPSR.

[22]  Michael Stonebraker,et al.  One Size Fits All? - Part 2: Benchmarking Results , 2007 .

[23]  Geoffrey C. Fox,et al.  MapReduce for Data Intensive Scientific Analyses , 2008, 2008 IEEE Fourth International Conference on eScience.

[24]  Kyuseok Shim,et al.  MapReduce Algorithms for Big Data Analysis , 2013, DNIS.

[25]  GhemawatSanjay,et al.  The Google file system , 2003 .

[26]  Magdalena Balazinska,et al.  ParaTimer: a progress indicator for MapReduce DAGs , 2010, SIGMOD Conference.

[27]  Christopher Olston,et al.  Building a HighLevel Dataflow System on top of MapReduce: The Pig Experience , 2009, Proc. VLDB Endow..