Optimization Factor Analysis of Large-Scale Join Queries on Different Platforms

Popular big data computing platforms, such as Spark, provide new computing paradigm for traditional database operations, such as queries. Except for the management ability of large-scale data, big data platforms earn the reputation for their simple programming interface and good performance of scaling out. But traditional databases have intrinsic optimization mechanisms for fundamental operators, which supports efficient and flexible data processing. It is very valuable to give a comprehensive view of these two kinds of platforms on data processing performance. In this paper, we focus on join operation, a primary and frequently used operator for both databases and big data analysis, design and conduct extensive experiments to test the performance of the two classic platforms under unified datasets and hardware, which will disclose the performance influence on computing schema, storage media, etc. Based on the experimental analysis, we also put forwards our advice on computing platform onsideration for different application scenarios.

[1]  Douglas Stott Parker,et al.  Map-reduce-merge: simplified relational data processing on large clusters , 2007, SIGMOD '07.

[2]  Anthony K. H. Tung,et al.  MAP-JOIN-REDUCE: Toward Scalable and Efficient Data Analysis on Large Clusters , 2011, IEEE Transactions on Knowledge and Data Engineering.

[3]  Jeffrey D. Ullman,et al.  Optimizing Multiway Joins in a Map-Reduce Environment , 2011, IEEE Transactions on Knowledge and Data Engineering.

[4]  Jignesh M. Patel,et al.  A comparison of join algorithms for log processing in MaPreduce , 2010, SIGMOD Conference.

[5]  Frederick Reiss,et al.  SparkBench - A Spark Performance Testing Suite , 2015, TPCTC.

[6]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[7]  Douglas Stott Parker,et al.  Traverse: Simplified Indexing on Large Map-Reduce-Merge Clusters , 2009, DASFAA.

[8]  Wang Wei,et al.  Efficient Join Query Processing Algorithm CHMJ Based on Hadoop , 2012 .

[9]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[10]  Jun Li,et al.  Efficient Join Query Processing Algorithm CHMJ Based on Hadoop: Efficient Join Query Processing Algorithm CHMJ Based on Hadoop , 2012 .

[11]  Joseph K. Bradley,et al.  Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.

[12]  Vinay Setty,et al.  Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing) , 2010, Proc. VLDB Endow..

[13]  Aoying Zhou,et al.  Join Optimization in the MapReduce Environment for Column-wise Data Store , 2010, 2010 Sixth International Conference on Semantics, Knowledge and Grids.