Multi-domain Schema Clustering and Hierarchical Mediated Schema Generation

In data integration, users can access multiple data sources through a uniform interface. Yet it is still not easy to query from data sources where many domains coexist even if the data sources are clustered into several domains since users have to write different query clauses for each different domain. Previous researches have presented various data integration techniques, but nearly all of them require the schemas of data sources to be integrated belong to the same domain, or failed to address that some different domains may be the sub-domains of a high level domain in which case a more abstract query clause for upper domain can substitute several less abstract query clauses for lower domains. In this paper, we propose a graph-based approach for clustering schemas which would finally expose to users a hierarchical mediated schema forest, and a query forwarding mechanism to transform queries down along the schema forest. A set of experimental results demonstrate that our schema clustering algorithm is effective in clustering the data sources into hierarchical schemas, queries on the mediated schemas could achieve answers with good accuracy, and the cost of writing query clauses for users is reduced without losing query accuracy.