论文信息 - Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing) - 字舞流文

Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing)

MapReduce is a computing paradigm that has gained a lot of attention in recent years from industry and research. Unlike parallel DBMSs, MapReduce allows non-expert users to run complex analytical tasks over very large data sets on very large clusters and clouds. However, this comes at a price: MapReduce processes tasks in a scan-oriented fashion. Hence, the performance of Hadoop --- an open-source implementation of MapReduce --- often does not match the one of a well-configured parallel DBMS. In this paper we propose a new type of system named Hadoop++: it boosts task performance without changing the Hadoop framework at all (Hadoop does not even 'notice it'). To reach this goal, rather than changing a working system (Hadoop), we inject our technology at the right places through UDFs only and affect Hadoop from inside. This has three important consequences: First, Hadoop++ significantly outperforms Hadoop. Second, any future changes of Hadoop may directly be used with Hadoop++ without rewriting any glue code. Third, Hadoop++ does not need to change the Hadoop interface. Our experiments show the superiority of Hadoop++ over both Hadoop and HadoopDB for tasks related to indexing and join processing.

Vinay Setty | Jorge-Arnulfo Quiané-Ruiz | Alekh Jindal | Jens Dittrich | Yagiz Kargin | Jörg Schad | Alekh Jindal | J. Dittrich | Jorge-Arnulfo Quiané-Ruiz | Y. Kargin | Vinay Setty | Jörg Schad

[1] David J. DeWitt,et al. Duplicate record elimination in large data files , 1983, TODS.

[2] Per-Åke Larson,et al. Data reduction through early grouping , 1994, CASCON.

[3] Kenneth A. Ross,et al. Cache Conscious Indexing for Decision-Support in Main Memory , 1999, VLDB.

[4] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[5] Douglas Stott Parker,et al. Map-reduce-merge: simplified relational data processing on large clusters , 2007, SIGMOD '07.

[6] Yuan Yu,et al. Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[7] Ravi Kumar,et al. Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[8] Jingren Zhou,et al. SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[9] Michael Stonebraker,et al. A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.

[10] Pete Wyckoff,et al. Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[11] Joseph M. Hellerstein,et al. MAD Skills: New Analysis Practices for Big Data , 2009, Proc. VLDB Endow..

[12] Abraham Silberschatz,et al. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads , 2009, Proc. VLDB Endow..

[13] A. Friesen,et al. KAMD : A Progress Estimator for MapReduce Pipelines , 2009 .

[14] Christopher Olston,et al. Building a HighLevel Dataflow System on top of MapReduce: The Pig Experience , 2009, Proc. VLDB Endow..

[15] Samuel Madden,et al. Osprey: Implementing MapReduce-style fault tolerance in a shared-nothing distributed database , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[16] Jeffrey D. Ullman,et al. Optimizing joins in a map-reduce environment , 2010, EDBT '10.

[17] Magdalena Balazinska,et al. Estimating the progress of MapReduce pipelines , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[18] Sanjay Ghemawat,et al. MapReduce: a flexible data processing tool , 2010, CACM.

[19] Christopher Ré,et al. Manimal: relational optimization for data-intensive programs , 2010, WebDB '10.

[20] Michael Stonebraker,et al. MapReduce and parallel DBMSs: friends or foes? , 2010, CACM.

[21] Joseph M. Hellerstein,et al. MapReduce Online , 2010, NSDI.