The Family of Map-Reduce

In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data, which called for a paradigm shift in the computing architecture and large scale data processing mechanisms. MapReduce is a simple and powerful programming model that enables easy development of scalable parallel applications that can process vast amounts of data on large clusters of commodity machines. MapReduce isolates the application from the details of running a distributed program, such as issues on data distribution, scheduling and fault tolerance. However, the original implementation of the MapReduce framework had some limitations that have been tackled by many research efforts in following up work. This chapter provides a comprehensive survey for a family of approaches and mechanisms of large scale data analysis that have been implemented based on the original father idea of the MapReduce framework, and are currently gaining a lot of momentum in both research and industrial communities. Some case studies are discussed as well.

[1]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[2]  Beng Chin Ooi,et al.  The performance of MapReduce , 2010, Proc. VLDB Endow..

[3]  Randy H. Katz,et al.  Above the Clouds: A Berkeley View of Cloud Computing , 2009 .

[4]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[5]  Chen Li,et al.  Efficient parallel set-similarity joins using MapReduce , 2010, SIGMOD Conference.

[6]  Vinay Setty,et al.  Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing) , 2010, Proc. VLDB Endow..

[7]  Yuanyuan Tian,et al.  CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop , 2011, Proc. VLDB Endow..

[8]  Joseph M. Hellerstein,et al.  MapReduce Online , 2010, NSDI.

[9]  Zheng Shao,et al.  Hive - a petabyte scale data warehouse using Hadoop , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[10]  Daniel M. Batista,et al.  A Survey of Large Scale Data Management Approaches in Cloud Environments , 2011, IEEE Communications Surveys & Tutorials.

[11]  Bingsheng He,et al.  Large graph processing in the cloud , 2010, SIGMOD Conference.

[12]  Abraham Silberschatz,et al.  Efficient processing of data warehousing queries in a split execution environment , 2011, SIGMOD '11.

[13]  Douglas Stott Parker,et al.  Map-reduce-merge: simplified relational data processing on large clusters , 2007, SIGMOD '07.

[14]  John Cieslewicz,et al.  SQL/MapReduce: A practical approach to self-describing, polymorphic, and parallelizable user-defined functions , 2009, Proc. VLDB Endow..

[15]  Shivnath Babu,et al.  Towards automatic optimization of MapReduce programs , 2010, SoCC '10.

[16]  Michael Stonebraker,et al.  The Case for Shared Nothing , 1985, HPTS.

[17]  Michael Isard,et al.  DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language , 2008, OSDI.

[18]  Yu Xu,et al.  Integrating hadoop and parallel DBMs , 2010, SIGMOD Conference.

[19]  Howard Gobioff,et al.  The Google file system , 2003, SOSP '03.

[20]  Michael Stonebraker,et al.  A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.

[21]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[22]  Jianmin Wang,et al.  MapDupReducer: detecting near duplicates over massive datasets , 2010, SIGMOD Conference.

[23]  Sanjay Ghemawat,et al.  MapReduce: a flexible data processing tool , 2010, CACM.

[24]  Michael D. Ernst,et al.  HaLoop , 2010, Proc. VLDB Endow..

[25]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[26]  Peter J. Haas,et al.  Ricardo: integrating R and Hadoop , 2010, SIGMOD Conference.

[27]  Michael Stonebraker,et al.  MapReduce and parallel DBMSs: friends or foes? , 2010, CACM.

[28]  Douglas Stott Parker,et al.  Traverse: Simplified Indexing on Large Map-Reduce-Merge Clusters , 2009, DASFAA.

[29]  Alexander S. Szalay,et al.  Petascale computational systems , 2007, Computer.

[30]  Aristides Gionis,et al.  Social Content Matching in MapReduce , 2011, Proc. VLDB Endow..

[31]  Jeffrey D. Ullman,et al.  Optimizing joins in a map-reduce environment , 2010, EDBT '10.

[32]  Andrey Balmin,et al.  Jaql , 2011, Proc. VLDB Endow..

[33]  Abraham Silberschatz,et al.  HadoopDB in action: building real world applications , 2010, SIGMOD Conference.

[34]  George Kollios,et al.  MRShare , 2010, Proc. VLDB Endow..

[35]  Steven Hand,et al.  Scripting the Cloud with Skywriting , 2010, HotCloud.

[36]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[37]  Rob Pike,et al.  Interpreting the data: Parallel analysis with Sawzall , 2005, Sci. Program..

[38]  Jignesh M. Patel,et al.  Energy management for MapReduce clusters , 2010, Proc. VLDB Endow..

[39]  Christopher Olston,et al.  Building a HighLevel Dataflow System on top of MapReduce: The Pig Experience , 2009, Proc. VLDB Endow..

[40]  Robert L. Grossman,et al.  Lessons learned from a year's worth of benchmarks of large data clouds , 2009, MTAGS '09.

[41]  Padmashree Ravindra,et al.  Towards scalable RDF graph analytics on MapReduce , 2010, MDAC '10.

[42]  Naphtali Rishe,et al.  Experiences on Processing Spatial Data with MapReduce , 2009, SSDBM.

[43]  Jignesh M. Patel,et al.  A comparison of join algorithms for log processing in MaPreduce , 2010, SIGMOD Conference.

[44]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[45]  Jingren Zhou,et al.  Incorporating partitioning and parallel plans into the SCOPE optimizer , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[46]  Tony Hey,et al.  The Fourth Paradigm: Data-Intensive Scientific Discovery , 2009 .

[47]  Silvio Lattanzi,et al.  Filtering: a method for solving graph problems in MapReduce , 2011, SPAA '11.

[48]  Abraham Silberschatz,et al.  HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads , 2009, Proc. VLDB Endow..