Similarity-based Queries for XML Databases Using ELIXIR

Several languages for querying and transforming XML, including XML-QL, Quilt, and XQL, have been proposed. However, these languages do not support ranked queries based on textual similarity, in the spirit of traditional IR. Several extensions to these XML query languages to support keyword search have been made, but the resulting languages cannot express IR-style queries such as "find books and CDs with similar titles." In some of these languages keywords are used merely as boolean filters without support for true ranked retrieval; others permit similarity calculations only between a data value and a constant, and thus cannot express the above query. WHIRL avoids both problems, but assumes relational data. We propose ELIXIR, an expressive and efficient language for XML information retrieval that extends XML-QL with a textual similarity operator that can be used for similarity joins, so ELIXIR is sufficiently expressive to handle the sample query above. ELIXIR thus qualifies as a general-purpose XML IR query language. Our central contribution is an efficient algorithm for answering ELIXIR queries that rewrites the original ELIXIR query into a series of XML-QL queries to generate intermediate relational data, and uses WHIRL to efficiently evaluate the similarity operators on this intermediate data, yielding an XML document with nodes ranked by similarity. Our experiments demonstrate that our prototype scales well with the size of the query and the XML data.

[1]  Jennifer Widom,et al.  A First Course in Database Systems , 1997 .

[2]  Mounia Lalmas,et al.  A Model for Representing and Retrieving Heterogeneous Structured Documents Based on Evidential Reasoning , 1999, Comput. J..

[3]  Donald D. Chamberlin,et al.  XQuery: a query language for XML , 2003, SIGMOD '03.

[4]  Dan Suciu,et al.  SilkRoute: trading between relations and XML , 2000, Comput. Networks.

[5]  Charles Elkan,et al.  The Field Matching Problem: Algorithms and Applications , 1996, KDD.

[6]  Mounia Lalmas,et al.  Dempster-Shafer's theory of evidence applied to structured documents: modelling uncertainty , 1997, SIGIR '97.

[7]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[8]  David J. DeWitt,et al.  Relational Databases for Querying XML Documents: Limitations and Opportunities , 1999, VLDB.

[9]  Gerald Salton,et al.  Automatic text processing , 1988 .

[10]  Alin Deutsch,et al.  XML-QL: A Query Language for XML , 1998 .

[11]  Roy Goldman,et al.  Proximity Search in Databases , 1998, VLDB.

[12]  Marti A. Hearst Trends & Controversies: Information integration , 1998, IEEE Intell. Syst..

[13]  Alin Deutsch,et al.  Storing semistructured data with STORED , 1999, SIGMOD '99.

[14]  Daniela Florescu,et al.  Quilt: An XML Query Language for Heterogeneous Data Sources , 2000, WebDB.

[15]  William W. Cohen Integration of heterogeneous databases without common domains using queries based on textual similarity , 1998, SIGMOD '98.

[16]  Gio Wiederhold,et al.  Mediators in the architecture of future information systems , 1992, Computer.

[17]  Michael Lesk How Can We Get High-Quality Electronic Journals? , 1998 .

[18]  Alin Deutsch,et al.  A Query Language for XML , 1999, Comput. Networks.

[19]  Ioana Manolescu,et al.  Integrating Keyword Search into XML Query Processing , 2000, BDA.

[20]  Nicolás Marín,et al.  Review of Data on the Web: from relational to semistructured data and XML by Serge Abiteboul, Peter Buneman, and Dan Suciu. Morgan Kaufmann 1999. , 2003, SGMD.

[21]  Dan Suciu,et al.  Data on the Web: From Relations to Semistructured Data and XML , 1999 .

[22]  Ricardo A. Baeza-Yates,et al.  Proximal nodes: a model to query document databases by content and structure , 1997, TOIS.

[23]  Gerhard Weikum,et al.  Adding Relevance to XML , 2000, WebDB.

[24]  William W. Cohen WHIRL: A word-based information representation language , 2000, Artif. Intell..