Optimizing Multi-join in Cloud Environment

In cloud computing, complex data analysis usually requires accessing multiple data sets. Existing MapReduce-based multi-join mechanism implements the join of multiple data sets by cascade method, which is flexible but poor efficiency. The paper analyzes existing concurrent join models and proposes a Two-Dimension Reducer matrix based Hierarchized Multi-Join model (TD-HMJ). TD-HMJ handles all the "key" attributes in one Map phase and divides the joined tables into several groups. Each group has three or two tables. In Reduce phase, the tables in each group can be joined at the same time by establishing a two-dimension Reducer matrix. TD-HMJ finishes the joining between groups through multiple Reduce processes. Theoretical analysis and experiment results show that TD-HMJ decreases the data transmission, curtails the time of multi-join, and increases the system efficiency.

[1]  Jing Xu,et al.  An Application-Based Adaptive Replica Consistency for Cloud Storage , 2010, 2010 Ninth International Conference on Grid and Cloud Computing.

[2]  Douglas Stott Parker,et al.  Map-reduce-merge: simplified relational data processing on large clusters , 2007, SIGMOD '07.

[3]  Keqiu Li,et al.  Big Data Processing in Cloud Computing Environments , 2012, 2012 12th International Symposium on Pervasive Systems, Algorithms and Networks.

[4]  Jinyun Fang,et al.  Multi-dimensional Index on Hadoop Distributed File System , 2010, 2010 IEEE Fifth International Conference on Networking, Architecture, and Storage.

[5]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[6]  Rodney S. Tucker,et al.  Green Cloud Computing: Balancing Energy in Processing, Storage, and Transport , 2011, Proceedings of the IEEE.

[7]  Stratis D. Viglas,et al.  SAND Join — A skew handling join algorithm for Google's MapReduce framework , 2011, 2011 IEEE 14th International Multitopic Conference.

[8]  Magdalena Balazinska,et al.  Skew-resistant parallel processing of feature-extracting scientific user-defined functions , 2010, SoCC '10.

[9]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[10]  Aoying Zhou,et al.  Efficient Star Join for Column-oriented Data Store in the MapReduce Environment , 2011, 2011 Eighth Web Information Systems and Applications Conference.

[11]  Christopher Olston,et al.  Building a HighLevel Dataflow System on top of MapReduce: The Pig Experience , 2009, Proc. VLDB Endow..

[12]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[13]  Magdalena Balazinska,et al.  SkewTune: mitigating skew in mapreduce applications , 2012, SIGMOD Conference.

[14]  Jeffrey D. Ullman,et al.  Optimizing joins in a map-reduce environment , 2010, EDBT '10.

[15]  Zhiyong Xu,et al.  SJMR: Parallelizing spatial join with MapReduce on clusters , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[16]  Beng Chin Ooi,et al.  Efficient B-tree based indexing for cloud data processing , 2010, Proc. VLDB Endow..

[17]  Beng Chin Ooi,et al.  Llama: leveraging columnar storage for scalable join processing in the MapReduce framework , 2011, SIGMOD '11.