Request Window: an Approach to Improve Throughput of RDBMS-based Data Integration System by Utilizing Data Sharing Across Concurrent Distributed Queries

This paper focuses on the problem of improving distributed query throughput of the RDBMS-based data integration system that has to inherit the query execution model of the underlying RDBMS: execute each query independently and utilize a global buffer pool mechanism to provide disk page sharing across concurrent query execution processes. However, this model is not suitable for processing concurrent distributed queries because the foundation, the memory-disk hierarchy, does not exist for data provided by remote sources. Therefore, the query engine cannot exploit any data sharing so that each process will have to interact with data sources independently: issue data requests and fetch data over the network. This paper presents Request Window, a novel DQP mechanism that can detect and employ data sharing opportunities across concurrent distributed queries. By combining multiple similar data requests issued to the same data source to a common data request, Request Window allows concurrent query executing processes to share the common result data. With the benefits of reduced source burdens and data transfers, the throughput of query engine can be significantly improved. This paper also introduces the IGNITE system, an extended PostgreSQL with DQP support. Our experimental results show that Request Window makes IGNITE achieve a 1.7x speedup over a commercial data integration system when running a workload of distributed TPC-H queries.

[1]  Rubao Lee,et al.  Extending PostgreSQL to Support Distributed/Heterogeneous Query Processing , 2007, DASFAA.

[2]  Timos K. Sellis,et al.  Multiple-query optimization , 1988, TODS.

[3]  Anastasia Ailamaki,et al.  QPipe: a simultaneously pipelined relational query engine , 2005, SIGMOD '05.

[4]  Dennis Shasha,et al.  2Q: A Low Overhead High Performance Buffer Management Replacement Algorithm , 1994, VLDB.

[5]  Nitesh V. Chawla,et al.  A Black-Box Approach to Query Cardinality Estimation , 2007, CIDR.

[6]  Alon Y. Halevy,et al.  An adaptive query execution system for data integration , 1999, SIGMOD '99.

[7]  Björn Þór Jónsson,et al.  Performance tradeoffs for client-server query processing , 1996, SIGMOD '96.

[8]  Laura M. Haas,et al.  Garlic: a new flavor of federated query processing for DB2 , 2002, SIGMOD '02.

[9]  Donald Kossmann,et al.  The state of the art in distributed query processing , 2000, CSUR.

[10]  N.V. Chawla,et al.  Estimating Query Result Sizes for Proxy Caching in Scientific Database Federations , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[11]  Peter Druschel,et al.  Anticipatory scheduling: a disk scheduling framework to overcome deceptive idleness in synchronous I/O , 2001, SOSP.

[12]  Divesh Srivastava,et al.  Semantic Data Caching and Replacement , 1996, VLDB.

[13]  Michael Stonebraker,et al.  The POSTGRES next generation database management system , 1991, CACM.

[14]  Alon Y. Halevy,et al.  Adapting to source properties in processing data integration queries , 2004, SIGMOD '04.

[15]  Jeffrey F. Naughton,et al.  Simultaneous optimization and evaluation of multiple dimensional queries , 1998, SIGMOD '98.

[16]  Giovanni Maria Sacco,et al.  Buffer management in relational database systems , 1986, TODS.

[17]  Jeffrey F. Naughton,et al.  Maximizing the Output Rate of Multi-Way Join Queries over Streaming Information Sources , 2003, VLDB.

[18]  Gerhard Weikum,et al.  The LRU-K page replacement algorithm for database disk buffering , 1993, SIGMOD Conference.

[19]  Divesh Srivastava,et al.  Performance and overhead of semantic cache management , 2006, TOIT.

[20]  Lee Rubao,et al.  Extending PostgreSQL to Support Distributed/Heterogeneous Query Processing , 2007 .

[21]  Goetz Graefe,et al.  Query evaluation techniques for large databases , 1993, CSUR.

[22]  Phillip M. Fernandez Red brick warehouse: a read-mostly RDBMS for open SMP platforms , 1994, SIGMOD '94.

[23]  José A. Blakeley,et al.  Distributed/heterogeneous query processing in Microsoft SQL server , 2005, 21st International Conference on Data Engineering (ICDE'05).

[24]  Nimrod Megiddo,et al.  ARC: A Self-Tuning, Low Overhead Replacement Cache , 2003, FAST.

[25]  Frederick Reiss,et al.  TelegraphCQ: Continuous Dataflow Processing for an Uncertain World , 2003, CIDR.

[26]  Laura M. Haas,et al.  Optimizing Queries Across Diverse Data Sources , 1997, VLDB.

[27]  Walid G. Aref,et al.  Hash-merge join: a non-blocking join algorithm for producing fast and early join results , 2004, Proceedings. 20th International Conference on Data Engineering.

[28]  Prasan Roy,et al.  Efficient and extensible algorithms for multi query optimization , 1999, SIGMOD '00.

[29]  S. Sudarshan,et al.  Pipelining in multi-query optimization , 2001, PODS '01.