Active semantic caching to optimize multidimensional data analysis in parallel and distributed environments

In this paper, we present a multi-query optimization framework based on the concept of active semantic caching. The framework permits the identification and transparent reuse of data and computation in the presence of multiple queries (or query batches) that specify user-defined operators and aggregations originating from scientific data-analysis applications. We show how query scheduling techniques, coupled with intelligent cache replacement policies, can further improve the performance of query processing by leveraging the active semantic caching operators. We also propose a methodology for functionally decomposing complex queries in terms of primitives so that multiple reuse sites are exposed to the query optimizer, to increase the amount of reuse. The optimization framework and the database system implemented with it are designed to be efficient irrespective of the underlying parallel and/or distributed machine configuration. We present experimental results highlighting the performance improvements obtained by our methods using real scientific data-analysis applications on multiple parallel and distributed processing configurations (e.g., single symmetric multiprocessor (SMP) machine, cluster of SMP nodes, and a Grid computing configuration).

[1]  Joel Saltz,et al.  On Cache Replacement Policies for Servicing Mixed Data Intensive Query Workloads , 2002 .

[2]  Gregory R. Ganger,et al.  Dynamic Function Placement for Data-Intensive Cluster Computing , 2000, USENIX Annual Technical Conference, General Track.

[3]  Tomas Akenine-Möller,et al.  A Benchmark for Animated Ray Tracing , 2001, IEEE Computer Graphics and Applications.

[4]  Margaret H. Dunham,et al.  Common Subexpression Processing in Multiple-Query Processing , 1998, IEEE Trans. Knowl. Data Eng..

[5]  Surajit Chaudhuri,et al.  An overview of query optimization in relational systems , 1998, PODS.

[6]  Jack Minker,et al.  Multiple Query Processing in Deductive Databases using Query Graphs , 1986, VLDB.

[7]  Goetz Graefe,et al.  Query evaluation techniques for large databases , 1993, CSUR.

[8]  Goetz Graefe,et al.  The Volcano optimizer generator: extensibility and efficient search , 1993, Proceedings of IEEE 9th International Conference on Data Engineering.

[9]  D. Roy Multi-temporal active-fire based burn scar detection algorithm , 1999 .

[10]  Nick Roussopoulos,et al.  MOCHA: a self-extensible database middleware system for distributed data sources , 2000, SIGMOD '00.

[11]  Martin F. Arlitt,et al.  Performance Evaluation of Web Proxy Cache Replacement Policies , 1998, Computer Performance Evaluation.

[12]  Joel H. Saltz,et al.  Multiple query optimization support for data analysis applications , 2003 .

[13]  Richard W. Hamming,et al.  Numerical methods for scientists and engineers (2nd ed.) , 1986 .

[14]  J. T. Robinson,et al.  Data cache management using frequency-based replacement , 1990, SIGMETRICS '90.

[15]  Craig A. Knoblock,et al.  Intelligent caching: selecting, representing, and reusing data in an information server , 1994, CIKM '94.

[16]  David A. Bader,et al.  Kronos : A software system for the processing and retrieval of large-scale AVHRR data sets , 2000 .

[17]  David J. DeWitt,et al.  Batch scheduling in parallel database systems , 1993, Proceedings of IEEE 9th International Conference on Data Engineering.

[18]  Donald Kossmann,et al.  The state of the art in distributed query processing , 2000, CSUR.

[19]  S. Sudarshan,et al.  Query scheduling in multi query optimization , 2001, Proceedings 2001 International Database Engineering and Applications Symposium.

[20]  Alan Sussman,et al.  A high performance multi-perspective vision studio , 2003, ICS '03.

[21]  Karsten Schwan,et al.  Dynamic Querying of Streaming Data with the dQUOB System , 2003, IEEE Trans. Parallel Distributed Syst..

[22]  Bharat K. Bhargava,et al.  Multiple-Query Optimization at Algorithm-Level , 1994, Data Knowl. Eng..

[23]  GraefeGoetz Query evaluation techniques for large databases , 1993 .

[24]  Amr El Abbadi,et al.  Multiple query optimization by cache-aware middleware using query teamwork , 2002, Proceedings 18th International Conference on Data Engineering.

[25]  S.N.V. Kalluri,et al.  A hierarchical data archiving and processing system to generate custom tailored products from AVHRR data , 1999, IEEE 1999 International Geoscience and Remote Sensing Symposium. IGARSS'99 (Cat. No.99CH36293).

[26]  Yannis Smaragdakis,et al.  The EELRU adaptive replacement algorithm , 2003, Perform. Evaluation.

[27]  Ruoming Jin,et al.  Simultaneous optimization of complex mining tasks with a knowledgeable cache , 2005, KDD '05.

[28]  Goetz Graefe,et al.  Algebraic Optimization of Computations over Scientific Databases , 1993, IEEE Data Eng. Bull..

[29]  Martin F. Arlitt,et al.  Improving Proxy Cache Performance: Analysis of Three Replacement Policies , 1999, IEEE Internet Comput..

[30]  Hongjun Lu,et al.  Workload Scheduling for Multiple Query Processing , 1995, Inf. Process. Lett..

[31]  Joel H. Saltz,et al.  Active Proxy-G: Optimizing the Query Execution Process in the Grid , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[32]  D.A. Menasce,et al.  Scaling for e-business , 2000, Proceedings 8th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (Cat. No.PR00728).

[33]  Krithi Ramamritham,et al.  Materialized view selection and maintenance using multi-query optimization , 2000, SIGMOD '01.

[34]  Ron Oldfield,et al.  Armada: a parallel file system for computational grids , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[35]  Willy Zwaenepoel,et al.  IO-Lite: a unified I/O buffering and caching system , 1999, TOCS.

[36]  Chialin Chang,et al.  Parallel aggregation on multi-dimensional scientific datasets , 2001 .

[37]  Timos K. Sellis,et al.  Multiple-query optimization , 1988, TODS.

[38]  Virgílio A. F. Almeida,et al.  Capacity Planning for Web Performance: Metrics, Models, and Methods , 1998 .

[39]  Alan Sussman,et al.  Principles for designing data-/compute-intensive distributed applications and middleware systems for heterogeneous environments , 2007, J. Parallel Distributed Comput..

[40]  Joel H. Saltz,et al.  Visualization of Large Data Sets with the Active Data Repository , 2001, IEEE Computer Graphics and Applications.

[41]  Sandy Irani,et al.  Cost-Aware WWW Proxy Caching Algorithms , 1997, USENIX Symposium on Internet Technologies and Systems.

[42]  Joel H. Saltz,et al.  Exploration and Visualization of Very Large Datasets with the Active Data Repository , 2001 .

[43]  淳 大堀 20世紀の名著名論:Peter J. Denning: The Working Set Model for Program Behavior , 2005 .

[44]  Ruoming Jin,et al.  Shared memory parallelization of data mining algorithms: techniques, programming interface, and performance , 2005, IEEE Transactions on Knowledge and Data Engineering.

[45]  Kyuseok Shim,et al.  Optimizing queries with materialized views , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[46]  Martin F. Arlitt,et al.  Performance evaluation of Web proxy cache replacement policies , 1998, Perform. Evaluation.

[47]  Joel H. Saltz,et al.  Digital dynamic telepathology-the Virtual Microscope , 1998, AMIA.

[48]  Nick Roussopoulos,et al.  MOCHA: a self-extensible database middleware system for distributed data sources , 2000, SIGMOD 2000.

[49]  Divesh Srivastava,et al.  Semantic Data Caching and Replacement , 1996, VLDB.

[50]  Aman Sinha,et al.  Prefetching and caching for query scheduling in a special class of distributed applications , 1996, Proceedings of the 1996 ICPP Workshop on Challenges for Parallel Processing.

[51]  Joel H. Saltz,et al.  Titan: a high-performance remote-sensing database , 1997, Proceedings 13th International Conference on Data Engineering.

[52]  Joel H. Saltz,et al.  Exploiting functional decomposition for efficient parallel processing of multiple data analysis queries , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[53]  Peter Scheuermann,et al.  Dynamic caching of query results for decision support systems , 1999, Proceedings. Eleventh International Conference on Scientific and Statistical Database Management.

[54]  Peter J. Denning,et al.  The working set model for program behavior , 1968, CACM.

[55]  Vijay Kumar,et al.  Semantic Caching and Query Processing , 2003, IEEE Trans. Knowl. Data Eng..

[56]  Hanan Samet,et al.  The Design and Analysis of Spatial Data Structures , 1989 .