Estimating local cost parameters for global query optimization in a multidatabase system

To meet users' growing needs for accessing pre-existing heterogeneous databases, a multidatabase system (MDBS) integrating multiple databases has attracted many researchers recently. A key feature of an MDBS is local autonomy. For a query retrieving data from multiple databases, global query optimization should be performed to achieve good system performance. There are a number of new challenges for global query optimization in an MDBS. Among them, a major one is that some local optimization information, such as local cost parameters, may not be available at the global level because of local autonomy. It creates difficulties for finding a good decomposition of a global query during query optimization. To tackle this challenge, a new query sampling method is proposed in this dissertation. The idea is to group component queries into homogeneous classes, draw a sample of queries from each class, and use observed costs of sample queries to derive a cost formula for each class by multiple regression. The derived formulas can be used to estimate the cost of a query during query optimization. The relevant issues, such as classification approaches, membership testing algorithms, sampling procedures, and cost model development, are explored in this dissertation. To verify the feasibility of the method, experiments were conducted on three commercial DBMSs. Experimental results are reported. They demonstrate that the proposed method is quite promising in estimating local cost parameters in an MDBS. Some considerations for implementing the method in an MDBS are given. To further improve the method, some extended and alternative approaches, such as plan-explain-based approach, adaptive approach, and fuzzy approach, are suggested.