Tuple MapReduce and Pangool: an associated implementation

This paper presents Tuple MapReduce, a new foundational model extending MapReduce with the notion of tuples. Tuple MapReduce allows to bridge the gap between the low-level constructs provided by MapReduce and higher-level needs required by programmers, such as compound records, sorting, or joins. This paper shows as well Pangool, an open-source framework implementing Tuple MapReduce. Pangool eases the design and implementation of applications based on MapReduce and increases their flexibility, still maintaining Hadoop’s performance. Additionally, this paper shows: pseudo-codes for relational joins, rollup, and the PageRank algorithm; a Pangool’s code example; benchmark results comparing Pangool with existing approaches; reports from users of Pangool in industry; and the description of a distributed database exploiting Pangool. These results show that Tuple MapReduce can be used as a direct, better-suited replacement of the MapReduce model in current implementations without the need of modifying key system fundamentals.

[1]  Ronald C. Taylor An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics , 2010, BMC Bioinformatics.

[2]  Robert L. Grossman,et al.  Data mining using high performance data clouds: experimental studies using sector and sphere , 2008, KDD.

[3]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[4]  Chunming Rong,et al.  Performance of Left Outer Join on Hadoop with Right Side within Single Node Memory Size , 2012, 2012 26th International Conference on Advanced Information Networking and Applications Workshops.

[5]  Jose Luis Fernandez-Marquez,et al.  Tuple MapReduce: Beyond Classic MapReduce , 2012, 2012 IEEE 12th International Conference on Data Mining.

[6]  Kevin Wilkinson,et al.  Data integration flows for business intelligence , 2009, EDBT '09.

[7]  ReedBenjamin,et al.  Building a high-level dataflow system on top of Map-Reduce , 2009, VLDB 2009.

[8]  Hans-Wolfgang Loidl,et al.  Comparing High Level MapReduce Query Languages , 2011, APPT.

[9]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[10]  Rob Pike,et al.  Interpreting the data: Parallel analysis with Sawzall , 2005, Sci. Program..

[11]  Sanjay Ghemawat,et al.  MapReduce: a flexible data processing tool , 2010, CACM.

[12]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[13]  Andrey Balmin,et al.  Jaql , 2011, Proc. VLDB Endow..

[14]  M. Slee,et al.  Thrift : Scalable Cross-Language Services Implementation , 2022 .

[15]  Hans-Wolfgang Loidl,et al.  Improving the diagnosis of mild hypertrophic cardiomyopathy with MapReduce , 2012, MapReduce '12.

[16]  Douglas Stott Parker,et al.  Map-reduce-merge: simplified relational data processing on large clusters , 2007, SIGMOD '07.

[17]  Craig Chambers,et al.  FlumeJava: easy, efficient data-parallel pipelines , 2010, PLDI '10.

[18]  Kunle Olukotun,et al.  Map-Reduce for Machine Learning on Multicore , 2006, NIPS.