论文信息 - HIP: Information Passing for Optimizing Join-Intensive Data Processing Workloads on Hadoop

HIP: Information Passing for Optimizing Join-Intensive Data Processing Workloads on Hadoop

Hadoop-based data processing platforms translate join intensive queries into multiple “jobs” (MapReduce cycles). Such multi-job workflows lead to a significant amount of data movement through the disk, network and memory fabric of a Hadoop cluster which could negatively impact performance and scalability. Consequently, techniques that minimize sizes of intermediate results will be useful in this context. In this paper, we present an information passing technique (HIP) that can minimize the size of intermediate data on Hadoop-based data processing platforms.

Seokyong Hong | Kemafor Anyanwu

[1] Beng Chin Ooi,et al. Llama: leveraging columnar storage for scalable join processing in the MapReduce framework , 2011, SIGMOD '11.

[2] Hamid Pirahesh,et al. Implementation of magic-sets in a relational database system , 1994, SIGMOD '94.

[3] Philip A. Bernstein,et al. Using Semi-Joins to Solve Relational Queries , 1981, JACM.

[4] Gerhard Weikum,et al. Scalable join processing on very large RDF graphs , 2009, SIGMOD Conference.

[5] Joseph M. Hellerstein,et al. Eddies: continuously adaptive query processing , 2000, SIGMOD 2000.

[6] Jignesh M. Patel,et al. A comparison of join algorithms for log processing in MaPreduce , 2010, SIGMOD Conference.

[7] Zachary G. Ives,et al. Sideways Information Passing for Push-Style Query Processing , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[8] Christopher Olston,et al. Building a HighLevel Dataflow System on top of MapReduce: The Pig Experience , 2009, Proc. VLDB Endow..

[9] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.