Scatter-Gather-Merge: An efficient star-join query processing algorithm for data-parallel frameworks

A data-parallel framework is very attractive for large-scale data processing since it enables such an application to easily process a huge amount of data on commodity machines. MapReduce, a popular data-parallel framework, is used in various fields such as web search, data mining and data warehouses; it is proven to be very practical for such a data-parallel application. A star-join query is a popular query in data warehouses that are a current target domain of data-parallel frameworks. This article proposes a new algorithm that efficiently processes star-join queries in data-parallel frameworks such as MapReduce and Dryad. Our star-join algorithm for general data-parallel frameworks is called Scatter-Gather-Merge, and it processes star-join queries in a constant number of computation steps, although the number of participating dimension tables increases. By adopting bloom filters, Scatter-Gather-Merge reduces a non-trivial amount of IO. We also show that Scatter-Gather-Merge can be easily applied to MapReduce. Our experimental results in both cluster and cloud environments show that Scatter-Gather-Merge outperforms existing approaches.

[1]  Rob Pike,et al.  Interpreting the data: Parallel analysis with Sawzall , 2005, Sci. Program..

[2]  Patrick E. O'Neil,et al.  Improved query performance with variant indexes , 1997, SIGMOD '97.

[3]  Krithi Ramamritham,et al.  Parallel Star Join + DataIndexes: Efficient Query Processing in Data Warehouses and OLAP , 2002, IEEE Trans. Knowl. Data Eng..

[4]  Xuedong Chen,et al.  The Star Schema Benchmark and Augmented Fact Table Indexing , 2009, TPCTC.

[5]  Michael Stonebraker,et al.  MapReduce and parallel DBMSs: friends or foes? , 2010, CACM.

[6]  曾耀国 Aster Data nCluster:一个大型数据管理和数据分析的新平台 , 2010 .

[7]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[8]  Douglas Stott Parker,et al.  Map-reduce-merge: simplified relational data processing on large clusters , 2007, SIGMOD '07.

[9]  Nick Roussopoulos,et al.  Materialized views and data warehouses , 1998, SGMD.

[10]  Goetz Graefe,et al.  Multi-table joins through bitmapped join indices , 1995, SGMD.

[11]  GhemawatSanjay,et al.  The Google file system , 2003 .

[12]  Jeffrey D. Ullman,et al.  Optimizing joins in a map-reduce environment , 2010, EDBT '10.

[13]  Jeffrey F. Naughton,et al.  Caching multidimensional queries using chunks , 1998, SIGMOD '98.

[14]  Robert L. Grossman,et al.  Data mining using high performance data clouds: experimental studies using sector and sphere , 2008, KDD.

[15]  Surajit Chaudhuri,et al.  An overview of data warehousing and OLAP technology , 1997, SGMD.

[16]  Chun Zhang,et al.  Storing and querying ordered XML using a relational database system , 2002, SIGMOD '02.

[17]  Abraham Silberschatz,et al.  HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads , 2009, Proc. VLDB Endow..

[18]  Sudha Ram,et al.  Proceedings of the 1997 ACM SIGMOD international conference on Management of data , 1997, ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems.

[19]  Philip S. Yu,et al.  On optimal processor allocation to support pipelined hash joins , 1993, SIGMOD Conference.

[20]  Purvi Naik,et al.  Processing Star Queries on Hierarchically- Clustered Fact Tables , 2003 .

[21]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[22]  Philip S. Yu,et al.  Applying Segmented Right-Deep Trees to Pipelining Multiple Hash Joins , 1995, IEEE Trans. Knowl. Data Eng..

[23]  W. H. Inmon,et al.  Building the data warehouse (2nd ed.) , 1996 .

[24]  W. H. Inmon,et al.  Building the data warehouse , 1992 .

[25]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[26]  Calisto Zuzarte,et al.  Star join revisited: Performance internals for cluster architectures , 2007, Data Knowl. Eng..

[27]  Alekh Jindal,et al.  Hadoop++ , 2010 .