The Index-Based XXL Search Engine for Querying XML Data with Relevance Ranking

Query languages for XML such as XPath or XQuery support Boolean retrieval: a query result is a (possibly restructured) subset of XML elements or entire documents that satisfy the search conditions of the query. This search paradigm works for highly schematic XML data collections such as electronic catalogs. However, for searching information in open environments such as the Web or intranets of large corporations, ranked retrieval is more appropriate: a query result is a rank list of XML elements in descending order of (estimated) relevance. Web search engines, which are based on the ranked retrieval paradigm, do, however, not consider the additional information and rich annotations provided by the structure of XML documents and their element names. This paper presents the XXL search engine that supports relevance ranking on XML data XXL is particularly geared for path queries with wildcards that can span multiple XML collections and contain both exact-match as well as semantic-similarity search conditions. In addition, ontological information and suitable index structures are used to improve the search efficiency and effectiveness. XXL is fully implemented as a suite of Java servlets. Experiments with a variety of structurally diverse XML data demonstrate the efficiency of the XXL search engine and underline its effectiveness for ranked retrieval.

[1]  Erhard Rahm,et al.  XMach-1: A Benchmark for XML Data Management , 2001, BTW.

[2]  William W. Cohen Integration of heterogeneous databases without common domains using queries based on textual similarity , 1998, SIGMOD '98.

[3]  Prabhakar Raghavan,et al.  Information retrieval algorithms: a survey , 1997, SODA '97.

[4]  Guido Moerkotte,et al.  Evaluating Queries on Structure with eXtended Access Support Relations , 2000, WebDB.

[5]  Sung-Hyon Myaeng,et al.  A flexible model for retrieval of SGML documents , 1998, SIGIR '98.

[6]  Vishu Krishnamurthy,et al.  Performance Challenges in Object-Relational DBMSs , 1999, IEEE Data Eng. Bull..

[7]  Ahmad Ashari,et al.  Storing And Querying XML Data Using RDBMS , 2004, iiWAS.

[8]  Jennifer Widom,et al.  Indexing Semistructured Data , 1998 .

[9]  Norbert Fuhr,et al.  HySpirit - A Probabilistic Inference Engine for Hypermedia Retrieval in Large Databases , 1998, EDBT.

[10]  N. Fuhr An Extension of XQL for Information Retrieval , 2000 .

[11]  David J. DeWitt,et al.  The Niagara Internet Query System , 2001, IEEE Data Eng. Bull..

[12]  Gerhard Weikum,et al.  Adding Relevance to XML , 2000, WebDB.

[13]  Roy Goldman,et al.  DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases , 1997, VLDB.

[14]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[15]  Erich J. Neuhold,et al.  Structured document storage and refined declarative and navigational access mechanisms in HyperStorM , 1997, The VLDB Journal.

[16]  Dan Suciu,et al.  Data on the Web: From Relations to Semistructured Data and XML , 1999 .

[17]  Eli Upfal,et al.  The Web as a graph , 2000, PODS.

[18]  Weiyi Meng,et al.  Using the Structure of HTML Documents to Improve Retrieval , 1997, USENIX Symposium on Internet Technologies and Systems.

[19]  Roy Goldman,et al.  Lore: a database management system for semistructured data , 1997, SGMD.

[20]  Norbert Fuhr,et al.  XIRQL: a query language for information retrieval in XML documents , 2001, SIGIR '01.

[21]  Ioana Manolescu,et al.  Integrating Keyword Search into XML Query Processing , 2000, BDA.

[22]  Nicolás Marín,et al.  Review of Data on the Web: from relational to semistructured data and XML by Serge Abiteboul, Peter Buneman, and Dan Suciu. Morgan Kaufmann 1999. , 2003, SGMD.

[23]  William W. Cohen Recognizing Structure in Web Pages using Similarity Queries , 1999, AAAI/IAAI.

[24]  Nicholas Kushmerick,et al.  Expressive and Efficient Ranked Querying of XML data , 2001, WebDB.

[25]  Martin L. Kersten,et al.  A Graph-Oriented Model for Articulation of Ontology Interdependencies , 1999, EDBT.