The SphereSearch Engine for Unified Ranked Retrieval of Heterogeneous XML and Web Documents

This paper presents the novel SphereSearch Engine that provides unified ranked retrieval on heterogeneous XML and Web data. Its search capabilities include vague structure conditions, text content conditions, and relevance ranking based on IR statistics and statistically quantified ontological relationships. Web pages in HTML or PDF are automatically converted into XML format, with the option of generating semantic tags by means of linguistic annotation tools. For Web data the XML-oriented query engine is leveraged to provide very rich search options that cannot be expressed in traditional Web search engines: concept-aware and link-aware querying that takes into account the implicit structure and context of Web pages. The benefits of the SphereSearch engine are demonstrated by experiments with a large and richly tagged but non-schematic open encyclopedia extended with external documents.

[1]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS '01.

[2]  Feng Shao,et al.  XRANK: ranked keyword search over XML documents , 2003, SIGMOD '03.

[3]  Dragomir R. Radev,et al.  Querying XML using structures and keywords in timber , 2003, SIGIR '03.

[4]  Hamish Cunningham,et al.  GATE-a General Architecture for Text Engineering , 1996, COLING.

[5]  Gerhard Weikum,et al.  Ontology-Enabled XML Search , 2003, Intelligent Search on XML Data.

[6]  Erhard Rahm,et al.  XMach-1: A Benchmark for XML Data Management , 2001, BTW.

[7]  Clement T. Yu,et al.  An effective approach to document retrieval via utilizing WordNet and recognizing phrases , 2004, SIGIR '04.

[8]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[9]  Gerhard Weikum,et al.  Intelligent Search on XML Data , 2003, Lecture Notes in Computer Science.

[10]  Paolo Merialdo,et al.  To Weave the Web , 1997, VLDB.

[11]  David Konopnicki,et al.  W3QS: A Query System for the World-Wide Web , 1995, VLDB.

[12]  S. Sudarshan,et al.  Keyword searching and browsing in databases using BANKS , 2002, Proceedings 18th International Conference on Data Engineering.

[13]  Alberto O. Mendelzon,et al.  WebOQL: restructuring documents, databases and Webs , 1998, Proceedings 14th International Conference on Data Engineering.

[14]  Gerhard Weikum,et al.  Efficient creation and incremental maintenance of the HOPI index for complex XML document collections , 2005, 21st International Conference on Data Engineering (ICDE'05).

[15]  Laks V. S. Lakshmanan,et al.  FleXPath: flexible structure and full-text querying for XML , 2004, SIGMOD '04.

[16]  Benjamin Piwowarski,et al.  An Algebra for Structured Queries in Bayesian Networks , 2004, INEX.

[17]  Jun Cai,et al.  Automatic Query Refinement Using Mined Semantic Relations , 2005, International Workshop on Challenges in Web Information Retrieval and Integration.

[18]  Gerhard Weikum,et al.  Top-k Query Evaluation with Probabilistic Guarantees , 2004, VLDB.

[19]  Nicholas Kushmerick,et al.  Similarity-based Queries for XML Databases Using ELIXIR , 2001, WWW Posters.

[20]  Gabriella Kazai,et al.  The INEX Evaluation Initiative , 2003, Intelligent Search on XML Data.

[21]  Luis Gravano,et al.  Efficient IR-Style Keyword Search over Relational Databases , 2003, VLDB.

[22]  Jennifer Widom,et al.  The Lorel query language for semistructured data , 1997, International Journal on Digital Libraries.

[23]  Claudio Carpineto,et al.  Merging XML Indices , 2004, INEX.

[24]  Georg Gottlob,et al.  The Lixto data extraction project: back and forth between theory and practice , 2004, PODS.

[25]  Gerhard Weikum,et al.  BINGO!: bookmark-induced gathering of information , 2002, Proceedings of the Third International Conference on Web Information Systems Engineering, 2002. WISE 2002..

[26]  Soumen Chakrabarti,et al.  Breaking Through the Syntax Barrier: Searching with Entities and Relations , 2004, ECML.

[27]  Norbert Fuhr,et al.  XIRQL: a query language for information retrieval in XML documents , 2001, SIGIR '01.

[28]  William W. Cohen,et al.  Exploiting dictionaries in named entity extraction: combining semi-Markov extraction processes and data integration methods , 2004, KDD.

[29]  Torsten Schlieder,et al.  Querying and ranking XML documents , 2002, J. Assoc. Inf. Sci. Technol..

[30]  Surya Nepal,et al.  Query processing issues in image (multimedia) databases , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[31]  Valter Crescenzi,et al.  RoadRunner: automatic data extraction from data-intensive web sites , 2002, SIGMOD '02.

[32]  Weblog Wikipedia,et al.  In Wikipedia the Free Encyclopedia , 2005 .

[33]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[34]  Arnaud Sahuguet,et al.  Building Light-Weight Wrappers for Legacy Web Data-Sources Using W4F , 1999, VLDB.

[35]  Tobias Dönz Extracting Structured Data from Web Pages , 2003 .

[36]  Doug Downey,et al.  Web-scale information extraction in knowitall: (preliminary results) , 2004, WWW '04.

[37]  Norbert Fuhr,et al.  Applying the Divergence from Randomness Approach for Content-Only Search in XML Documents , 2004, ECIR.

[38]  Norbert Fuhr,et al.  Information Extraction and Automatic Markup for XML Documents , 2003, Intelligent Search on XML Data.

[39]  Larry Kerschberg,et al.  A semantic taxonomy-based personalizable meta-search agent , 2001, Proceedings of the Second International Conference on Web Information Systems Engineering.

[40]  Yehoshua Sagiv,et al.  XSEarch: A Semantic Search Engine for XML , 2003, VLDB.

[41]  Ioana Manolescu,et al.  XMark: A Benchmark for XML Data Management , 2002, VLDB.

[42]  Gerhard Weikum,et al.  The Index-Based XXL Search Engine for Querying XML Data with Relevance Ranking , 2002, EDBT.