论文信息 - Efficient Star Join for Column-oriented Data Store in the MapReduce Environment

Efficient Star Join for Column-oriented Data Store in the MapReduce Environment

Map Reduce is a parallel computing paradigm that has gained a lot of attention from both industry and academia recent years. Unlike parallel DBMSs, with Map Reduce, it is easier for non-expert to develop scalable parallel programs for analytical applications over huge data sets across clusters of commodity machines. As the nature of scan-oriented processing, the performance of Map Reduce for relation operators can be enhanced dramatically since it is inevitably accessing lots of unnecessary data tuples, especially for table join operators. In this paper, we propose an efficient star join strategy called HdBmp join for column-oriented data store by using a three-level content aware index (i.e., HdBmp Index). Armed with this index, most of the unnecessary tuples in the join processing can be filtered out, and consequently result in immense reduction in both communication cost and execution time. Our extensive experimental studies confirm the efficiency, scalability and effectiveness of our new proposed join methods.

Aoying Zhou | Fan Xia | Minqi Zhou | Haitong Zhu

[1] Vinay Setty,et al. Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing) , 2010, Proc. VLDB Endow..

[2] Jeffrey D. Ullman,et al. Optimizing joins in a map-reduce environment , 2010, EDBT '10.

[3] D. DeWitt. MapReduce: A major step backwards | The Database Column , 2011 .

[4] Owen Kaser,et al. Sorting improves word-aligned bitmap indexes , 2010, Data Knowl. Eng..

[5] Ravi Kumar,et al. Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[6] Setrag Khoshafian,et al. A decomposition storage model , 1985, SIGMOD Conference.

[7] Kesheng Wu,et al. FastBit: An Efficient Indexing Technology For Accelerating Data-Intensive Science , 2005 .

[8] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[9] Douglas Stott Parker,et al. Map-reduce-merge: simplified relational data processing on large clusters , 2007, SIGMOD '07.

[10] Marcin Zukowski,et al. MonetDB/X100: Hyper-Pipelining Query Execution , 2005, CIDR.

[11] Alekh Jindal,et al. Hadoop++ , 2010 .

[12] Anthony K. H. Tung,et al. MAP-JOIN-REDUCE: Toward Scalable and Efficient Data Analysis on Large Clusters , 2011, IEEE Transactions on Knowledge and Data Engineering.

[13] Abraham Silberschatz,et al. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads , 2009, Proc. VLDB Endow..