PowerHash: a hybrid grouping scheme by leveraging power-law properties of data

We study GroupBy implementation scheme which is widely used in distributed systems and databases. The GroupBy operation partitions a set of out-of-order records into groups. Due to the massive data size, many I/O-efficient grouping schemes that exploit external memory have been proposed. In this paper, we observe that the group sizes of many real data exhibit power-law property and the grouping schemes’ performance varies a lot for data with different group sizes. The indexing–filling approach prefers data with big group size, while the partitioned hash approach prefers data with small group size. Based on this observation, we propose a hybrid approach, PowerHash , which invokes different grouping schemes for different data. The group size information is approximately estimated by the count-min sketch so that the big groups and small groups can be distinguished from each other. With a given memory budget, our results show that PowerHash can improve performance by up to six times over the existing GroupBy implementations.

[1]  Florin Radulescu,et al.  MongoDB vs Oracle -- Database Comparison , 2012, 2012 Third International Conference on Emerging Intelligent Data and Web Technologies.

[2]  M. Newman Power laws, Pareto distributions and Zipf's law , 2005 .

[3]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[4]  Yuen Ren Chao,et al.  Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology , 1950 .

[5]  Surajit Chaudhuri,et al.  Database tuning advisor for microsoft SQL server 2005: demo , 2005, SIGMOD '05.

[6]  Bruce Momjian,et al.  PostgreSQL: Introduction and Concepts , 2000 .

[7]  Ravindra Khattree,et al.  An alternative data analytic approach to measure the univariate and multivariate skewness , 2019, International Journal of Data Science and Analytics.

[8]  Joydeep Ghosh,et al.  AdaHash: hashing-based scalable, adaptive hierarchical clustering of streaming data on Mapreduce frameworks , 2018, International Journal of Data Science and Analytics.

[9]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[10]  Susie Stephens,et al.  Oracle Database 10g: a platform for BLAST search and Regular Expression pattern matching in life sciences , 2004, Nucleic Acids Res..

[11]  Gianmarco De Francisci Morales,et al.  The power of both choices: Practical load balancing for distributed stream processing engines , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[12]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[13]  Liang Lin,et al.  Tenzing a SQL implementation on the MapReduce framework , 2011, Proc. VLDB Endow..

[14]  Kjell Bratbergsengen,et al.  Hashing Methods and Relational Algebra Operations , 1984, VLDB.

[15]  Lada A. Adamic,et al.  The Nature of Markets in the World Wide Web , 1999 .

[16]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[17]  Prashant J. Shenoy,et al.  A platform for scalable one-pass analytics using MapReduce , 2011, SIGMOD '11.

[18]  Michael Isard,et al.  Distributed aggregation for data-parallel computing: interfaces and implementations , 2009, SOSP '09.

[19]  Graham Cormode,et al.  Count-Min Sketch , 2016, Encyclopedia of Algorithms.