Skew characteristics and their effects on parallel relational query processing

As queries grow increasingly complex and large data sets are becoming prevalent, Parallel Query Processing, database sizes grow dramatically particularly in Decision Support Systems (DSS) , and OnLine Analytic Processing Systems ( OIAP) which have recently emerged as important database applications. In these systems, performance is a critical issue and speeding up the system has always been an objective but the processing power of individual processors can only handle a small fraction of current applications. As a result, parallel processing is exploited to improve database systems performance. In the thesis we focus on relational database systems and study skew characteristics and their effects on parallel query processing.

[1]  Jim Gray Parallel database systems 101 , 1995, SIGMOD '95.

[2]  Won Kim,et al.  A Model of Queries for Object-Oriented Databases , 1989, VLDB.

[3]  Carlo Zaniolo,et al.  Optimization of Nonrecursive Queries , 1986, VLDB.

[4]  E. F. CODD,et al.  A relational model of data for large shared data banks , 1970, CACM.

[5]  Kien A. Hua,et al.  Handling Data Skew in Multiprocessor Database Computers Using Partition Tuning , 1991, VLDB.

[6]  David J. DeWitt,et al.  Tradeoffs in Processing Complex Join Queries via Hashing in Multiprocessor Database Machines , 1990, VLDB.

[7]  E. F. Codd,et al.  The Relational Model for Database Management, Version 2 , 1990 .

[8]  W. H. Inmon,et al.  Building the data warehouse (2nd ed.) , 1996 .

[9]  Stanley Y. W. Su,et al.  Database computers : principles, architectures, and techniques , 1988 .

[10]  Carolyn Turbyfill,et al.  Performance Evaluation of Main Memory Database Systems , 1986 .

[11]  Yannis E. Ioannidis,et al.  Balancing histogram optimality and practicality for query result size estimation , 1995, SIGMOD '95.

[12]  M. V. Wilkes,et al.  The Art of Computer Programming, Volume 3, Sorting and Searching , 1974 .

[13]  George Karypis,et al.  Introduction to Parallel Computing , 1994 .

[14]  C. H. C. Leung Quantitative Analysis of Computer Systems , 1990, SIGMETRICS Perform. Evaluation Rev..

[15]  Randolph Nelson,et al.  Probability, Stochastic Processes, and Queueing Theory , 1995 .

[16]  M. Livny,et al.  Partially Preemptive Hash Joins , 1993, SIGMOD Conference.

[17]  Philip S. Yu,et al.  Effectiveness of Parallel Joins , 1990, IEEE Trans. Knowl. Data Eng..

[18]  Kian-Lee Tan,et al.  Multi-Join Optimization for Symmetric Multiprocessors , 1993, VLDB.

[19]  Clement T. Yu,et al.  Partition Strategy for Distributed Query Processing in Fast Local Networks , 1989, IEEE Trans. Software Eng..

[20]  Jeffrey F. Naughton,et al.  Using shared virtual memory for parallel join processing , 1993, SIGMOD '93.

[21]  Philip S. Yu,et al.  Combining Join and Semi-Join Operations for Distributed Query Processing , 1993, IEEE Trans. Knowl. Data Eng..

[22]  Randy H. Katz,et al.  Performance modeling and analysis of disk arrays , 1993 .

[23]  Sakti Pramanik,et al.  Distributed Linear Hashing for Main Memory Databases , 1990, ICPP.

[24]  Kyung-Chang Kim Parallelism in object-oriented query processing , 1990, [1990] Proceedings. Sixth International Conference on Data Engineering.

[25]  Joseph Y.-T. Leung,et al.  Complexity of Scheduling Parallel Task Systems , 1989, SIAM J. Discret. Math..

[26]  Yi Jiang,et al.  Taxonomy of skew in parallel databases , 1994 .

[27]  Philip S. Yu,et al.  A Parallel Sort Merge Join Algorithm for Managing Data Skew , 1993, IEEE Trans. Parallel Distributed Syst..

[28]  Yousef Saad,et al.  Data communication in parallel architectures , 1989, Parallel Comput..

[29]  John Zahorjan,et al.  Zahorjan processor allocation policies for message-passing parallel computers , 1994, SIGMETRICS 1994.

[30]  Margaret H. Dunham,et al.  Join processing in relational databases , 1992, CSUR.

[31]  Patrick Valduriez,et al.  Open issues in parallel query optimization , 1996, SGMD.

[32]  Clifford A. Lynch,et al.  Selectivity Estimation and Query Optimization in Large Databases with Highly Skewed Distribution of Column Values , 1988, VLDB.

[33]  Hamid Pirahesh,et al.  Parallelism in relational data base systems: architectural issues and design approaches , 1990, DPDS '90.

[34]  Tom W. Keller,et al.  Data placement in Bubba , 1988, SIGMOD '88.

[35]  Shi-Kuo Chang,et al.  Site Selection in Distributed Query Processing , 1982, ICDCS.

[36]  Matthias Jarke,et al.  Query Optimization in Database Systems , 1984, CSUR.

[37]  Per-Ake Larson,et al.  Performing Group-By before Join , 1994, ICDE 1994.

[38]  Per Ola Börjesson,et al.  Simple Approximations of the Error Function Q(x) for Communications Applications , 1979, IEEE Trans. Commun..

[39]  Jon Page A Study of a Parallel Database Machine and its Performance the NCR/Teradata DBC/1012 , 1992, BNCOD.

[40]  Patrick Valduriez,et al.  Prototyping Bubba, A Highly Parallel Database System , 1990, IEEE Trans. Knowl. Data Eng..

[41]  Wei Hong Parallel Query Processing Using Shared Memory Multiprocessors and Disk Arrays , 1992 .

[42]  Jeffrey F. Naughton,et al.  Sampling Issues in Parallel Database Systems , 1992, EDBT.

[43]  Philip S. Yu,et al.  A Parallel Hash Join Algorithm for Managing Data Skew , 1993, IEEE Trans. Parallel Distributed Syst..

[44]  Michael Stonebraker,et al.  The Design of XPRS , 1988, VLDB.

[45]  Hongjun Lu,et al.  Optimization of Multi-Way Join Queries for Parallel Execution , 1991, VLDB.

[46]  DAVID P. HELMBOLD,et al.  Modeling Speedup (n) Greater than n , 1990, IEEE Trans. Parallel Distributed Syst..

[47]  Goetz Graefe,et al.  Encapsulation of parallelism in the Volcano query processing system , 1990, SIGMOD '90.

[48]  Michael Stonebraker,et al.  Performance enhancements to a relational database system , 1983, TODS.

[49]  Arnold L. Rosenberg,et al.  Scattering and Gathering Messages in Networks of Processors , 1993, IEEE Trans. Computers.

[50]  Stavros Christodoulakis,et al.  On the propagation of errors in the size of join results , 1991, SIGMOD '91.

[51]  Dennis Shasha,et al.  Optimizing equijoin queries in distributed databases where relations are hash partitioned , 1991, TODS.

[52]  Rajeev Motwani,et al.  Optimization Algorithms for Exploiting the Parallelism-Communication Tradeoff in Pipelined Parallelism , 1994, VLDB.

[53]  Clement T. Yu,et al.  Distributed query processing , 1984, CSUR.

[54]  Rajeev Motwani,et al.  Scheduling problems in parallel query optimization , 1995, PODS '95.

[55]  Jignesh M. Patel,et al.  Accurate Modeling of the Hybrid Hash Join Algorithm , 1994, SIGMETRICS.

[56]  Patrick Valduriez,et al.  Principles of Distributed Database Systems , 1990 .

[57]  Roderic G. G. Cattell The benchmark handbook for database and transaction processing systems , 1991 .

[58]  Lubomir F. Bic,et al.  AGM: a dataflow database machine , 1989, TODS.

[59]  Goetz Graefe,et al.  Query evaluation techniques for large databases , 1993, CSUR.

[60]  Michael V. Mannino,et al.  Statistical profile estimation in database systems , 1988, CSUR.

[61]  Yi Jiang,et al.  Query Execution in the Presence of Data Skew in Parallel Databases , 1996, Australasian Database Conference.

[62]  Hongjun Lu,et al.  Design and evaluation of parallel pipelined join algorithms , 1987, SIGMOD '87.

[63]  Hongjun Lu,et al.  Dynamic and Load-balanced Task-Oriented Datbase Query Processing in Parallel Systems , 1992, EDBT.

[64]  Michael Stonebraker,et al.  The POSTGRES next generation database management system , 1991, CACM.

[65]  Donovan A. Schneider,et al.  The Gamma Database Machine Project , 1990, IEEE Trans. Knowl. Data Eng..

[66]  Won Kim,et al.  On optimizing an SQL-like nested query , 1982, TODS.

[67]  Naphtali Rishe,et al.  An instant and accurate size estimation method for joins and selections in a retrieval-intensive environment , 1993, SIGMOD '93.

[68]  Ambuj Shatdal,et al.  Processing Aggregates in Parallel Database Systems , 1994 .

[69]  Hans-Peter Kriegel,et al.  Parallel processing of spatial joins using R-trees , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[70]  David J. DeWitt,et al.  Parallel algorithms for the execution of relational database operations , 1983, TODS.

[71]  Michael Stonebraker,et al.  Optimization of parallel query execution plans in XPRS , 1991, [1991] Proceedings of the First International Conference on Parallel and Distributed Information Systems.

[72]  Philip S. Yu,et al.  On Workload Characterization of Relational Database Environments , 1992, IEEE Trans. Software Eng..

[73]  Giinter von Biiltzingsloewen Translating and Optimizing SQL Queries Having Aggregates , 1987 .

[74]  Roger D. Chamberlain,et al.  Beyond execution time: expanding the use of performance models , 1994, IEEE Parallel & Distributed Technology: Systems & Applications.

[75]  L. A. Goodman On the Estimation of the Number of Classes in a Population , 1949 .

[76]  Clement H. C. Leung Dynamic storage fragmentation and file deterioration , 1986, IEEE Transactions on Software Engineering.

[77]  Philip S. Yu,et al.  A Hierarchical Approach to Parallel Multiquery Scheduling , 1995, IEEE Trans. Parallel Distributed Syst..

[78]  Eugene Wong,et al.  Query processing in a system for distributed databases (SDD-1) , 1981, TODS.

[79]  Haran Boral,et al.  Parallelism in Bubba , 1988, Proceedings [1988] International Symposium on Databases in Parallel and Distributed Systems.

[80]  Randy H. Katz,et al.  A case for redundant arrays of inexpensive disks (RAID) , 1988, SIGMOD '88.