A data locality optimization algorithm for large-scale data processing in Hadoop

Data-intensive applications are increasingly designed to execute on large computing clusters. Our previous observation on Tencent production systems has indicated that join query is one of the most important queries in large-scale data processing. When running a join query on Hive system, the job of the join query is divided into map phase and reduce phase, and requires transferring large amounts of intermediate results over the network, which is inefficient. In this paper, we proposed an algorithm called CHMJ, the general idea of the algorithm is to take advantage of data locality to accelerate calculation. It includes four parts, Data distribution strategy, Parallel HashMapJoin Algorithm, CoLocation Scheduling and Delay scheduling strategy. CHMJ has been adopted in Tencent data warehouse, and plays an important role in Tencent's daily operations. Our relevant experiments demonstrate the feasibility and efficiency of our solution.

[1]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[2]  Kenneth A. Ross,et al.  Cache Conscious Indexing for Decision-Support in Main Memory , 1999, VLDB.

[3]  Douglas Stott Parker,et al.  Map-reduce-merge: simplified relational data processing on large clusters , 2007, SIGMOD '07.

[4]  David R. Karger,et al.  Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web , 1997, STOC '97.

[5]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[6]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[7]  Martin L. Kersten,et al.  Database Architecture Optimized for the New Bottleneck: Memory Access , 1999, VLDB.

[8]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[9]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[10]  Michael Isard,et al.  DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language , 2008, OSDI.

[11]  Jeffrey F. Naughton,et al.  Cache Conscious Algorithms for Relational Query Processing , 1994, VLDB.

[12]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[13]  David J. DeWitt,et al.  Parallel database systems: the future of high performance database systems , 1992, CACM.

[14]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[15]  David J. DeWitt,et al.  A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor environment , 1989, SIGMOD '89.

[16]  Elke A. Rundensteiner,et al.  Revisiting the Role of Pipelined Parallelism in Multi-Join Query Processing , 2005 .

[17]  Scott Shenker,et al.  Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling , 2010, EuroSys '10.

[18]  Zheng Shao,et al.  Hive - a petabyte scale data warehouse using Hadoop , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).