With XML fast becoming the de facto standard for representing structured metadata in databases and internet applications, there is a rise in the need for efficient search mechanisms for the searching such repositories in several application domains. In this poster, we outline the requirements of a search engine, and lay the theoretical foundations for a fast and efficient search mechanism for these XML (and other) repositories. We formulate the problem of finding matching schemas as the problem of computing a maximum matching in the pairwise bipartite graph formed from the query and repository schema attributes. The edges of the bipartite graph capture the similarity between corresponding attributes in the schema. To ensure meaningful matches, we use both name and type semantics in modeling attribute similarity. Since detailed graph matching is compute-intensive, our approach uses upper and lower bounds on the size of the matching to prune candidate schemas. Finally, we develop a technique for schema indexing called attribute hashing for fast database schema indexing. The matching schemas of the database are then found by indexing the hash table using query attributes, performing lower bound computations for maximum matching, and recording peaks in the resulting histogram of hits. The key rationale used is that related schemas in the database have an overwhelming number of attributes semantically-related to query attributes so that indexing based on query attributes could only point to relevant matching schemas.
[1]
Andrew V. Goldberg,et al.
An efficient cost scaling algorithm for the assignment problem
,
1995,
Math. Program..
[2]
AnHai Doan,et al.
Corpus-based schema matching
,
2005,
21st International Conference on Data Engineering (ICDE'05).
[3]
Mark K. Goldberg,et al.
An efficient parallel algorithm that finds independent sets of guaranteed size
,
1993,
SODA '90.
[4]
George A. Miller,et al.
WordNet: A Lexical Database for the English Language
,
2002
.