论文信息 - Tuple MapReduce and Pangool: an associated implementation

Tuple MapReduce and Pangool: an associated implementation

This paper presents Tuple MapReduce, a new foundational model extending MapReduce with the notion of tuples. Tuple MapReduce allows to bridge the gap between the low-level constructs provided by MapReduce and higher-level needs required by programmers, such as compound records, sorting, or joins. This paper shows as well Pangool, an open-source framework implementing Tuple MapReduce. Pangool eases the design and implementation of applications based on MapReduce and increases their flexibility, still maintaining Hadoop’s performance. Additionally, this paper shows: pseudo-codes for relational joins, rollup, and the PageRank algorithm; a Pangool’s code example; benchmark results comparing Pangool with existing approaches; reports from users of Pangool in industry; and the description of a distributed database exploiting Pangool. These results show that Tuple MapReduce can be used as a direct, better-suited replacement of the MapReduce model in current implementations without the need of modifying key system fundamentals.

Jose Luis Fernandez-Marquez | Giovanna Di Marzo Serugendo | Eric Palacios | Pedro Ferrera | Ivan de Prado

[1] Ronald C. Taylor. An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics , 2010, BMC Bioinformatics.

[2] Robert L. Grossman,et al. Data mining using high performance data clouds: experimental studies using sector and sphere , 2008, KDD.

[3] Pete Wyckoff,et al. Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[4] Chunming Rong,et al. Performance of Left Outer Join on Hadoop with Right Side within Single Node Memory Size , 2012, 2012 26th International Conference on Advanced Information Networking and Applications Workshops.

[5] Jose Luis Fernandez-Marquez,et al. Tuple MapReduce: Beyond Classic MapReduce , 2012, 2012 IEEE 12th International Conference on Data Mining.

[6] Kevin Wilkinson,et al. Data integration flows for business intelligence , 2009, EDBT '09.

[7] ReedBenjamin,et al. Building a high-level dataflow system on top of Map-Reduce , 2009, VLDB 2009.

[8] Hans-Wolfgang Loidl,et al. Comparing High Level MapReduce Query Languages , 2011, APPT.

[9] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[10] Rob Pike,et al. Interpreting the data: Parallel analysis with Sawzall , 2005, Sci. Program..

[11] Sanjay Ghemawat,et al. MapReduce: a flexible data processing tool , 2010, CACM.

[12] Rajeev Motwani,et al. The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[13] Andrey Balmin,et al. Jaql , 2011, Proc. VLDB Endow..

[14] M. Slee,et al. Thrift : Scalable Cross-Language Services Implementation , 2022 .

[15] Hans-Wolfgang Loidl,et al. Improving the diagnosis of mild hypertrophic cardiomyopathy with MapReduce , 2012, MapReduce '12.

[16] Douglas Stott Parker,et al. Map-reduce-merge: simplified relational data processing on large clusters , 2007, SIGMOD '07.

[17] Craig Chambers,et al. FlumeJava: easy, efficient data-parallel pipelines , 2010, PLDI '10.

[18] Kunle Olukotun,et al. Map-Reduce for Machine Learning on Multicore , 2006, NIPS.