The state of the art in distributed query processing

Distributed data processing is becoming a reality. Businesses want to do it for many reasons, and they often must do it in order to stay competitive. While much of the infrastructure for distributed data processing is already there (e.g., modern network technology), a number of issues make distributed data processing still a complex undertaking: (1) distributed systems can become very large, involving thousands of heterogeneous sites including PCs and mainframe server machines; (2) the state of a distributed system changes rapidly because the load of sites varies over time and new sites are added to the system; (3) legacy systems need to be integrated—such legacy systems usually have not been designed for distributed data processing and now need to interact with other (modern) systems in a distributed environment. This paper presents the state of the art of query processing for distributed database and information systems. The paper presents the “textbook” architecture for distributed query processing and a series of techniques that are particularly useful for distributed database systems. These techniques include special join techniques, techniques to exploit intraquery paralleli sm, techniques to reduce communication costs, and techniques to exploit caching and replication of data. Furthermore, the paper discusses different kinds of distributed systems such as client-server, middleware (multitier), and heterogeneous database systems, and shows how query processing works in these systems.

[1]  Raghu Ramakrishnan,et al.  Database Management Systems , 1976 .

[2]  Michael Stonebraker,et al.  Distributed query processing in a relational data base system , 1978, SIGMOD Conference.

[3]  共立出版株式会社 コンピュータ・サイエンス : ACM computing surveys , 1978 .

[4]  Raymond A. Lorie,et al.  The Compilation of a High Level Data Language , 1979, Research Report / RJ / IBM / San Jose, California.

[5]  Patricia G. Selinger,et al.  Access path selection in a relational database management system , 1979, SIGMOD '79.

[6]  Edward Babb,et al.  Implementing a relational database by means of specialzed hardware , 1979, TODS.

[7]  Abraham Silberschatz,et al.  Database System Concepts , 1980 .

[8]  Patricia G. Selinger,et al.  Support for repetitive transactions and ad hoc queries in System R , 1981, TODS.

[9]  Eugene Wong,et al.  Query processing in a system for distributed databases (SDD-1) , 1981, TODS.

[10]  Dean Daniels,et al.  R*: An Overview of the Architecture , 1986, JCDKB.

[11]  Umeshwar Dayal,et al.  Processing Queries Over Generalization Hierarchies in a Multidatabase System , 1983, VLDB.

[12]  Patrick Valduriez,et al.  Join and Semijoin Algorithms for a Multiprocessor Database Machine , 1984, TODS.

[13]  Stefano Ceri,et al.  Distributed Databases: Principles and Systems , 1984 .

[14]  Clement T. Yu,et al.  Distributed query processing , 1984, CSUR.

[15]  Hongjun Lu,et al.  Some Experimental Results on Distributed Join Algorithms in a Local Network , 1985, VLDB.

[16]  Hector Garcia-Molina,et al.  Consistency in a partitioned network: a survey , 1985, CSUR.

[17]  Michael Stonebraker,et al.  The Case for Shared Nothing , 1985, HPTS.

[18]  Domenico Ferrari,et al.  Performance analysis of several back-end database architectures , 1986, TODS.

[19]  Guy M. Lohman,et al.  Optimizer Validation and Performance Evaluation for Distributed Queries , 1998 .

[20]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[21]  Hongjun Lu,et al.  Load balancing in a locally distributed DB system , 1986, SIGMOD '86.

[22]  M. Carey,et al.  Load Balancing in a Locally Distributed Database System , 1986, SIGMOD Conference.

[23]  Michael Stonebraker,et al.  The Design and Implementation of Distributed INGRES , 1986, The INGRES Papers.

[24]  David J. DeWitt,et al.  The EXODUS optimizer generator , 1987, SIGMOD '87.

[25]  Guy M. Lohman,et al.  Grammar-like functional rules for representing query optimization alternatives , 1988, SIGMOD '88.

[26]  Timos K. Sellis,et al.  Multiple-query optimization , 1988, TODS.

[27]  Jeffrey D. Ullman,et al.  Principles of Database and Knowledge-Base Systems, Volume II , 1988, Principles of computer science series.

[28]  Peter M G Apers,et al.  Data allocation in distributed database systems , 1988, TODS.

[29]  Tom W. Keller,et al.  Data placement in Bubba , 1988, SIGMOD '88.

[30]  Michael Stonebraker,et al.  Readings in Database Systems , 1988 .

[31]  Karen Ward,et al.  Dynamic query evaluation plans , 1989, SIGMOD '89.

[32]  Hamid Pirahesh,et al.  Extensible query processing in starburst , 1989, SIGMOD '89.

[33]  Donald D. Chamberlin,et al.  Access Path Selection in a Relational Database Management System , 1989 .

[34]  A. Sheth Federated database systems for managing distributed, heterogeneous, and autonomous databases , 1990, CSUR.

[35]  David J. DeWitt,et al.  A Study of Three Alternative Workstation-Server Architectures for Object Oriented Database Systems , 1990, VLDB.

[36]  Patrick Valduriez,et al.  Principles of Distributed Database Systems , 1990 .

[37]  GraefeGoetz Encapsulation of parallelism in the Volcano query processing system , 1990 .

[38]  David J. DeWitt,et al.  Tradeoffs in Processing Complex Join Queries via Hashing in Multiprocessor Database Machines , 1990, VLDB.

[39]  Michael J. Carey,et al.  A performance evaluation of pointer-based joins , 1990, SIGMOD '90.

[40]  Won Kim,et al.  Architecture of the ORION Next-Generation Database System , 1990, IEEE Trans. Knowl. Data Eng..

[41]  Darrell Woelk,et al.  Query Processing in Distributed ORION , 1990, EDBT.

[42]  Goetz Graefe,et al.  Encapsulation of parallelism in the Volcano query processing system , 1990, SIGMOD '90.

[43]  David Maier,et al.  Efficient assembly for complex objects , 1991, SIGMOD '91.

[44]  A. N. Wilschut,et al.  Dataflow query execution in a parallel main-memory environment , 1991, [1991] Proceedings of the First International Conference on Parallel and Distributed Information Systems.

[45]  Nick Roussopoulos,et al.  An incremental access method for ViewCache: concept, algorithms, and cost analysis , 1991, TODS.

[46]  Yannis E. Ioannidis,et al.  Left-deep vs. bushy trees: an analysis of strategy spaces and its implications for query optimization , 1991, SIGMOD '91.

[47]  David Maier,et al.  Issues in Distributed Object Assembly , 1992, IWDOM.

[48]  Michael J. Carey,et al.  Compensation-based on-line query processing , 1992, SIGMOD '92.

[49]  Hamid Pirahesh,et al.  Extensible/rule based query rewrite optimization in Starburst , 1992, SIGMOD '92.

[50]  Margaret H. Dunham,et al.  Join processing in relational databases , 1992, CSUR.

[51]  Andreas Reuter,et al.  Transaction Processing: Concepts and Techniques , 1992 .

[52]  Andrew S. Tanenbaum,et al.  Modern Operating Systems , 1992 .

[53]  Weimin Du,et al.  Query Optimization in a Heterogeneous DBMS , 1992, VLDB.

[54]  David J. DeWitt,et al.  Parallel database systems: the future of high performance database systems , 1992, CACM.

[55]  Gio Wiederhold,et al.  Intelligent integration of information , 1993, SIGMOD Conference.

[56]  Goetz Graefe,et al.  The Volcano optimizer generator: extensibility and efficient search , 1993, Proceedings of IEEE 9th International Conference on Data Engineering.

[57]  Miron Livny,et al.  Local Disk Caching for Client-Server Database Systems , 1993, VLDB.

[58]  Goetz Graefe,et al.  Query evaluation techniques for large databases , 1993, CSUR.

[59]  David J. DeWitt,et al.  Parallel pointer-based join techniques for object-oriented databases , 1993, [1993] Proceedings of the Second International Conference on Parallel and Distributed Information Systems.

[60]  Patrick Valduriez,et al.  Distributed Object Management , 1993 .

[61]  Patrick Valduriez,et al.  On the Effectiveness of Optimization Search Strategies for Parallel Execution Spaces , 1993, VLDB.

[62]  Arthur M. Keller,et al.  Persistence software: bridging object-oriented programming and relational databases , 1993, SIGMOD '93.

[63]  Donald F. Ferguson,et al.  An economy for managing replicated data in autonomous decentralized systems , 1993, Proceedings ISAD 93: International Symposium on Autonomous Decentralized Systems.

[64]  Alan R. Simon,et al.  Understanding the New SQL: A Complete Guide , 1993 .

[65]  Ari Luotonen,et al.  World-Wide Web Proxies , 1994, Comput. Networks ISDN Syst..

[66]  Liuba Shrira,et al.  Opportunistic log: efficient installation reads in a reliable storage server , 1994, OSDI '94.

[67]  Arthur M. Keller,et al.  A predicate-based caching scheme for client-server database architectures , 1994, Proceedings of 3rd International Conference on Parallel and Distributed Information Systems.

[68]  J. O'Toole,et al.  Opportunistic Log : Efficient Reads in a Reliable Object Server , 1994 .

[69]  Nick Roussopoulos,et al.  Adaptive selectivity estimation using query feedback , 1994, SIGMOD '94.

[70]  Barbara Liskov,et al.  Reducing cross domain call overhead using batched futures , 1994, OOPSLA 1994.

[71]  Alfons Kemper,et al.  Dual-Buffering Strategies in Object Bases , 1994, VLDB.

[72]  Goetz Graefe,et al.  Optimization of dynamic query evaluation plans , 1994, SIGMOD '94.

[73]  Michael Stonebraker Readings in Database Systems, Second Edition , 1994 .

[74]  Barbara Liskov,et al.  Reducing cross domain call overhead using batched futures , 1994, OOPSLA '94.

[75]  David J. DeWitt,et al.  Shoring up persistent applications , 1994, SIGMOD '94.

[76]  Per-Åke Larson,et al.  A query sampling method for estimating local cost parameters in a multidatabase system , 1994, Proceedings of 1994 IEEE 10th International Conference on Data Engineering.

[77]  Margo I. Seltzer,et al.  The case for geographical push-caching , 1995, Proceedings 5th Workshop on Hot Topics in Operating Systems (HotOS-V).

[78]  G. Graefe The Cascades Framework for Query Optimization , 1995, IEEE Data Eng. Bull..

[79]  Weimin Du,et al.  Pegasus: A Heterogeneous Information Management System , 1995, Modern Database Systems.

[80]  Bernhard Mitschang,et al.  Implementing Dynamic Code Assembly for Client-Based Query Processing , 1995, DASFAA.

[81]  The ADMS Project: View R Us , 1995, IEEE Data Eng. Bull..

[82]  N. Roussopoulos,et al.  The Adms Project: Views \ R " Us , 1995 .

[83]  Ismailcem Budak Arpinar,et al.  METU interoperable database system , 1995, SGMD.

[84]  Rajeev Motwani,et al.  Coloring Away Communication in Parallel Query Optimization , 1995, VLDB.

[85]  Rafael Alonso,et al.  Broadcast disks: data management for asymmetric communication environments , 1995, SIGMOD '95.

[86]  Weimin Du,et al.  Reducing multidatabase query response time by tree balancing , 1995, SIGMOD '95.

[87]  Jennifer Widom,et al.  Research problems in data warehousing , 1995, CIKM '95.

[88]  Laura M. Haas,et al.  Towards heterogeneous multimedia information systems: the Garlic approach , 1995, Proceedings RIDE-DOM'95. Fifth International Workshop on Research Issues in Data Engineering-Distributed Object Management.

[89]  Jennifer Widom,et al.  Object exchange across heterogeneous information sources , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[90]  Donald Kossmann,et al.  A Performance Evaluation of OID Mapping Techniques , 1995, VLDB.

[91]  Jeffrey D. Ullman,et al.  A Query Translation Scheme for Rapid Implementation of Wrappers , 1995, DOOD.

[92]  Kevin Strehlo,et al.  Why decision support fails and how to fix it , 1995, SGMD.

[93]  Norbert Ritter,et al.  Workstation/Server-Architekturen für datenbankbasierte Ingenieuranwendungen , 1995, Informatik Forschung und Entwicklung.

[94]  Azer Bestavros,et al.  Server-Initated Document Dissemination for the WWW , 1996, IEEE Data Eng. Bull..

[95]  Goetz Graefe Iterators, schedulers, and distributed-memory parallelism , 1996 .

[96]  Joann J. Ordille,et al.  Querying Heterogeneous Information Sources Using Source Descriptions , 1996, VLDB.

[97]  Michael Stonebraker,et al.  Mariposa: a wide-area distributed database system , 1996, The VLDB Journal.

[98]  Divesh Srivastava,et al.  Semantic Data Caching and Replacement , 1996, VLDB.

[99]  ZhaoHui Tang,et al.  Cost-based Selection of Path Expression Processing Algorithms in Object-Oriented Databases , 1996, VLDB.

[100]  Luis Gravano,et al.  Optimizing queries over multimedia repositories , 1996, SIGMOD 1996.

[101]  Peter Scheuermann,et al.  WATCHMAN : A Data Warehouse Intelligent Cache Manager , 1996, VLDB.

[102]  Abraham Silberschatz,et al.  Efficient and accurate cost models for parallel query optimization (extended abstract) , 1996, PODS.

[103]  Chad Carson,et al.  Optimizing queries over multimedia repositories , 1996, SIGMOD '96.

[104]  Albert D'Andrea,et al.  UniSQL's next-generation object-relational database management system , 1996, SGMD.

[105]  Goetz Graefe Iterators, Schedulers, and Distributed-memory Parallelism , 1996, Softw. Pract. Exp..

[106]  Reudiger Buck-Emden,et al.  Sap R/3 System: A Client/Server Technology , 1996 .

[107]  Venky Harinarayan,et al.  Implementing Data Cubes E ciently , 1996 .

[108]  Björn Þór Jónsson,et al.  Performance tradeoffs for client-server query processing , 1996, SIGMOD '96.

[109]  David B. Lomet Replicated indexes for distributed data , 1996, Fourth International Conference on Parallel and Distributed Information Systems.

[110]  Asuman Dogac,et al.  Dynamic query optimization on a distributed object management platform , 1996, CIKM '96.

[111]  Arthur M. Keller,et al.  A predicate-based caching scheme for client-server database architectures , 1994, Proceedings of 3rd International Conference on Parallel and Distributed Information Systems.

[112]  Ronald Fagin,et al.  Combining fuzzy information from multiple systems (extended abstract) , 1996, PODS.

[113]  K. Selçuk Candan,et al.  Query caching and optimization in distributed mediator systems , 1996, SIGMOD '96.

[114]  Donald F. Ferguson,et al.  Economic models for allocating resources in computer systems , 1996 .

[115]  Stanley Zdonik,et al.  Prefetching from a broadcast disk , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[116]  Michael Stonebraker,et al.  Data replication in Mariposa , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[117]  Jeffrey D. Ullman,et al.  Implementing data cubes efficiently , 1996, SIGMOD '96.

[118]  Timos K. Sellis,et al.  Parametric query optimization , 1992, The VLDB Journal.

[119]  Jennifer Widom,et al.  On-line warehouse view maintenance , 1997, SIGMOD '97.

[120]  Luis Gravano,et al.  Merging Ranks from Heterogeneous Internet Sources , 1997, VLDB.

[121]  Laura M. Haas,et al.  Optimizing Queries Across Diverse Data Sources , 1997, VLDB.

[122]  Alfons Kemper,et al.  Finding Data in the Neighborhood , 1997, VLDB.

[123]  Anand Rajaraman,et al.  Virtual database technology , 1997, SGMD.

[124]  Luis Gravano,et al.  STARTS: Stanford proposal for Internet meta-searching , 1997, SIGMOD '97.

[125]  Graham Hamilton,et al.  Jdbc Database Access with Java: A Tutorial and Annotated Reference , 1997 .

[126]  Sushil Jajodia,et al.  An adaptive data replication algorithm , 1997, TODS.

[127]  Mary Roth,et al.  Don't Scrap It, Wrap It! A Wrapper Architecture for Legacy Data Sources , 1997, VLDB.

[128]  Jian Yang,et al.  Algorithms for Materialized View Design in Data Warehousing Environment , 1997, VLDB.

[129]  Michael J. Carey,et al.  On saying “Enough already!” in SQL , 1997, SIGMOD '97.

[130]  KemperAlfons,et al.  Database performance in the real world , 1997 .

[131]  L. Amsaleg,et al.  Improving Responsiveness for Wide-Area Data Access. , 1997 .

[132]  David Jordan,et al.  The Object Database Standard: ODMG 2.0 , 1997 .

[133]  Michael J. FranklinUniversity Cache Investment Strategies , 1997 .

[134]  Michael J. Carey,et al.  Highly concurrent cache consistency for indices in client-server database systems , 1997, SIGMOD '97.

[135]  A. Dogac,et al.  Dynamic Query Optimization in Multidatabases. , 1997 .

[136]  Clement T. Yu,et al.  Priniples of Database Query Processing for Advanced Applications , 1997 .

[137]  Miron Livny,et al.  Transactional client-server cache consistency: alternatives and performance , 1997, TODS.

[138]  Alfons Kemper,et al.  Database performance in the real world: TPC-D and SAP R/3 , 1997, SIGMOD '97.

[139]  Stanley B. Zdonik,et al.  Balancing push and pull for data broadcast , 1997, SIGMOD '97.

[140]  Serge Abiteboul,et al.  Querying Semi-Structured Data , 1997, Encyclopedia of Database Systems.

[141]  Peter Buneman,et al.  Semistructured data , 1997, PODS.

[142]  Ronald Fagin,et al.  Incorporating User Preferences in Multimedia Queries , 1997, ICDT.

[143]  Guido Moerkotte,et al.  Heuristic and randomized optimization for the join ordering problem , 1997, The VLDB Journal.

[144]  Patrick Valduriez,et al.  Scaling Access to Heterogeneous Data Sources with DISCO , 1998, IEEE Trans. Knowl. Data Eng..

[145]  Michael J. Carey,et al.  Reducing the Braking Distance of an SQL Query Engine , 1998, VLDB.

[146]  Florian Matthes,et al.  SAP R/3: A Database Application System (Tutorial) , 1998, SIGMOD Conference.

[147]  Laurent Amsaleg,et al.  Cost-based query scrambling for initial delays , 1998, SIGMOD '98.

[148]  Alberto O. Mendelzon,et al.  Database techniques for the World-Wide Web: a survey , 1998, SGMD.

[149]  Guy M. Lman Grammar-like Functional Rules for Representing Query Optimization Alternatives , 1998 .

[150]  Ashok Joshi,et al.  50,000 users on an Oracle8 universal server database , 1998, SIGMOD '98.

[151]  David J. DeWitt,et al.  Efficient mid-query re-optimization of sub-optimal query execution plans , 1998, SIGMOD '98.

[152]  Jeffrey F. Naughton,et al.  Caching multidimensional queries using chunks , 1998, SIGMOD '98.

[153]  Bernhard Mitschang,et al.  Advanced data processing in KRISYS: modeling concepts, implementation techniques, and client/server issues , 1998, The VLDB Journal.

[154]  Alfons Kemper,et al.  Evaluating Functional Joins Along Nested Reference Sets in Object-Relational and Object-Oriented Databases , 1998, VLDB.

[155]  Florian Matthes,et al.  SAP R/3 (tutorial): a database application system , 1998, SIGMOD '98.

[156]  Michael J. Franklin,et al.  Scheduling for large-scale on-demand data broadcasting , 1998, Proceedings. IEEE INFOCOM '98, the Conference on Computer Communications. Seventeenth Annual Joint Conference of the IEEE Computer and Communications Societies. Gateway to the 21st Century (Cat. No.98.

[157]  Hubert Naacke,et al.  Leveraging mediator cost models with heterogeneous data sources , 1998, Proceedings 14th International Conference on Data Engineering.

[158]  D. DeWitt,et al.  Efficient mid-query re-optimization of sub-optimal query execution plans , 1998, ACM SIGMOD Conference.

[159]  Mohamed Ziauddin,et al.  Materialized Views in Oracle , 1998, VLDB.

[160]  Heiko Schuldt,et al.  Exporting Database Functionality - The CONCERT Way , 1998, IEEE Data Eng. Bull..

[161]  M. Franklin,et al.  Cache Investment for , 1998 .

[162]  Stanley B. Zdonik,et al.  “Data in your face”: push technology in perspective , 1998, SIGMOD '98.

[163]  Zachary G. Ives,et al.  An adaptive query execution engine for data integration , 1999 .

[164]  Jennifer Widom,et al.  Query Optimization for XML , 1999, VLDB.

[165]  Patrick Valduriez,et al.  Principles of Distributed Database Systems, Second Edition , 1999 .

[166]  Laura M. Haas,et al.  Loading a Cache with Query Results , 1999, VLDB.

[167]  Dan Suciu,et al.  Data on the Web: From Relations to Semistructured Data and XML , 1999 .

[168]  Alfons Kemper,et al.  Database patchwork on the Internet , 1999, SIGMOD '99.

[169]  Alon Y. Halevy,et al.  An adaptive query execution system for data integration , 1999, SIGMOD '99.

[170]  Ronald Fagin,et al.  Combining Fuzzy Information from Multiple Systems , 1999, J. Comput. Syst. Sci..

[171]  Michael J. Franklin,et al.  XJoin: Getting Fast Answers From Slow and Bursty Networks , 1999 .

[172]  Laura M. Haas,et al.  Cost Models DO Matter: Providing Cost Information for Diverse Data Sources in a Federated System , 1999, VLDB.

[173]  Inderpal Singh Mumick,et al.  An Incremental Access Method of View Cache: Concept, Algorithms, and Cost Analysis , 1999 .

[174]  Ioana Manolescu,et al.  Integrating Keyword Search into XML Query Processing , 2000, BDA.

[175]  Alfons Kemper,et al.  Functional-join processing , 2000, The VLDB Journal.

[176]  Michael J. Franklin,et al.  Cache investment: integrating query optimization and distributed data placement , 2000, TODS.

[177]  Donald Kossmann,et al.  Iterative dynamic programming: a new class of query optimization algorithms , 2000, TODS.

[178]  Alon Y. Halevy,et al.  Answering queries using views: A survey , 2001, The VLDB Journal.

[179]  Alfons Kemper,et al.  Integrating semi-join-reducers into state-of-the-art query processors , 2001, Proceedings 17th International Conference on Data Engineering.

[180]  1 Background and Motivation , 2002 .