Querying distributed heterogeneous structured and semi-structured data sources

The continuing growth and widespread popularity of the internet means that the collection of useful data available for public access is rapidly increasing both in number and size. These data are spread over distributed heterogeneous data sources like traditional databases or sources of various forms containing unstructured and semi-structured data. Obviously, the value of these data sources would in many cases be greatly enhanced if the data they contain could be combined and queried in a uniform manner. The research work reported in this dissertation is concerned with querying and integrating a multiplicity of distributed heterogeneous structured data residing in relational databases and semi-structured data held in well- formed XML documents produced by internet applications or human- coded. In particular, we have addressed the problems of: (1) specifying the mappings between a global schema and the local data sources' schemas, and resolving the heterogeneity which can occur between data models, schemas or schema concepts (2) processing queries that are expressed on a global schema into local queries. We have proposed an approach to combine and query the data sources through a mediation layer. Such a layer is intended to establish and evolve an XML Metadata Knowledge Base (XMKB) incrementally which assists the Query Processor in mediating between user queries posed over the global schema and the queries on the underlying distributed heterogeneous data sources. It translates such queries into sub-queries -called local queries- which are appropriate to each local data source. The XMKB is built in a bottom-up fashion by extracting and merging incrementally the metadata of the data sources. It holds the data source's information (names, types and locations), descriptions of the mappings between the global schema and the participating data source schemas, and function names for handling semantic and structural discrepancies between the representations. To demonstrate our research, we have designed and implemented a prototype system called SISSD (System to Integrate Structured and Semi- structured Databases). The system automatically creates a GUI tool for meta-users (who do the metadata integration) which they use to describe mappings between the global schema and local data source schemas. These mappings are used to produce the XMKB. The SISSD allows the translation of user queries into sub-queries fitting each participating data source, by exploiting the mapping information stored in the XMKB. The major results of the thesis are: (1) an approach that facilitates building structured and semi-structured data integration systems (2) a method for generating mappings between a global and local schemas' paths, and resolving the conflicts caused by the heterogeneity of the data sources such as naming, structural, and semantic conflicts which, may occur between the schemas (3) a method for translating queries in terms of a global schema into sub-queries in terms of local schemas. Hence, the presented approach shows that: (a) mapping of the schemas' paths can only be partially automated, since the logical heterogeneity problems need to be resolved by human judgment based on the application requirements (b) querying distributed heterogeneous structured and semi-structured data sources is possible

[1]  Laura M. Haas,et al.  The Clio project: managing heterogeneity , 2001, SGMD.

[2]  Peter Schäuble Multimedia Information Retrieval , 1997 .

[3]  Stefano Spaccapietra,et al.  Conflicts and correspondence assertions in interoperable databases , 1991, SGMD.

[4]  Alon Y. Halevy,et al.  Answering queries using views: A survey , 2001, The VLDB Journal.

[5]  Craig A. Knoblock,et al.  Ariadne: a system for constructing mediators for Internet sources , 1998, SIGMOD '98.

[6]  David Schach,et al.  XML Query Language (XQL) , 1998, QL.

[7]  Stefano Ceri,et al.  Distributed Databases: Principles and Systems , 1984 .

[8]  Maurizio Lenzerini,et al.  Data integration: a theoretical perspective , 2002, PODS.

[9]  Silvana Castano,et al.  Semantic integration of semistructured and structured data sources , 1999, SGMD.

[10]  Cong Yu,et al.  Constraint-based XML query rewriting for data integration , 2004, SIGMOD '04.

[11]  Stephen Fox,et al.  Heterogeneous distributed database systems for production use , 1990, ACM Comput. Surv..

[12]  Yannis Papakonstantinou,et al.  Enhancing semistructured data mediators with document type definitions , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[13]  Jungyun Seo,et al.  Classifying schematic and data heterogeneity in multidatabase systems , 1991, Computer.

[14]  Gongzhu Hu,et al.  Integration and querying of distributed databases , 2003, Proceedings Fifth IEEE Workshop on Mobile Computing Systems and Applications.

[15]  Zachary G. Ives,et al.  Efficient query processing for data integration , 2002 .

[16]  Peter Fankhauser,et al.  XML data integration with OWL: experiences and challenges , 2004, 2004 International Symposium on Applications and the Internet. Proceedings..

[17]  Adele E. Howe,et al.  Experiences with selecting search engines using metasearch , 1997, TOIS.

[18]  Stéphane Bressan,et al.  Context Interchange: New Features and Formalisms for the Intelligent Integration of Information Context Interchange: New Features and Formalisms for the Intelligent Integration of Information , 1997 .

[19]  Laura M. Haas,et al.  Garlic: a new flavor of federated query processing for DB2 , 2002, SIGMOD '02.

[20]  Amit P. Sheth,et al.  Semantic Issues in Multidatabase Systems - Preface by the Special Issue Editor , 1991, SIGMOD Rec..