Schema extraction and levelization for XML data

XML is a new standard for representing and exchanging information on the Internet. An XML data is a data that is tagged by XML elements. Such an XML data can be retrieved not only by a Boolean connection with keywords on the Internet. Keyword-based information retrieval does not precisely result in user requests partly because user requests cannot be properly conveyed. Either too many or too few matches are produced. It is not trivial to formulate what to retrieve for a good-sized query-result. In conventional approaches, a database schema is useful for users to formulate queries and for query processing. Likewise, this paper proposes a method of schema extraction for XML data collection. Obtaining one single schema is not sufficient to serve for the good size of information retrieval and adaptively for the various requests from Internet users. To support this, schemas are then levelized with respect to the frequency of topological data structures in a database. The topological structural information of these schemas is used to formulate queries and further to rewrite queries for relaxation and restriction. Without modification, the method proposed in this paper is used not only for multimedia XML data collections but also for general XML databases.

[1]  Divyakant Agrawal,et al.  Scalable collection summarization and selection , 1999, DL '99.

[2]  Dan Suciu,et al.  Catching the boat with Strudel: experiences with a Web-site management system , 1998, SIGMOD '98.

[3]  C. M. Sperberg-McQueen,et al.  Extensible Markup Language (XML) , 1997, World Wide Web J..

[4]  Henk M. Blanken,et al.  The SQL3 Server Interface , 1997, Multimedia Databases in Perspective.

[5]  Jennifer Widom,et al.  The Lorel query language for semistructured data , 1997, International Journal on Digital Libraries.

[6]  Alin Deutsch,et al.  Querying XML Data , 1999, IEEE Data Eng. Bull..

[7]  Jennifer Widom,et al.  Query Optimization for XML , 1999, VLDB.

[8]  Alon Y. Halevy,et al.  Using Probabilistic Information in Data Integration , 1997, VLDB.

[9]  J. Paul Tremblay,et al.  Discrete Mathematical Structures with Applications to Computer Science , 1975 .

[10]  C. M. Sperberg-McQueen,et al.  eXtensible Markup Language (XML) 1.0 (Second Edition) , 2000 .

[11]  Yannis E. Ioannidis,et al.  An efficient bitmap encoding scheme for selection queries , 1999, SIGMOD '99.

[12]  Dan Suciu,et al.  A query language and optimization techniques for unstructured data , 1996, SIGMOD '96.

[13]  Roy Goldman,et al.  DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases , 1997, VLDB.

[14]  Tim Bray,et al.  Presenting Xml , 1997 .

[15]  Guido Moerkotte,et al.  Evaluating queries with generalized path expressions , 1996, SIGMOD '96.

[16]  Piotr Indyk,et al.  Enhanced hypertext categorization using hyperlinks , 1998, SIGMOD '98.