SIRIUS: A Lightweight XML Indexing and Approximate Search System at INEX 2005

This paper reports on SIRIUS, a lightweight indexing and search engine for XML documents. The retrieval approach implemented is document oriented. It involves an approximate matching scheme of the structure and textual content. Instead of managing the matching of whole DOM trees, SIRIUS splits the documents object model in a set of paths. In this view, the request is a path-like expression with conditions on the attribute values. In this paper, we present the main functionalities and characteristics of this XML IR system and second we relate on our experience on adapting and using it for the INEX 2005 ad-hoc retrieval task. Finally, we present and analyze the SIRIUS retrieval performance obtained during the INEX 2005 evaluation campaign and show that despite the lightweight characteristics of SIRIUS we were able to retrieve highly relevant non overlapping XML elements and obtained quite good precision at low recall values.

[1]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[2]  Djoerd Hiemstra,et al.  TIJAH Scratches INEX 2005: Vague Element Selection, Image Search, Overlap, and Relevance Feedback , 2005, INEX.

[3]  Gabriella Kazai,et al.  Advances in XML Information Retrieval and Evaluation: 4th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2005, Dagstuhl ... Papers (Lecture Notes in Computer Science) , 2006 .

[4]  Pierre-François Marteau,et al.  Information Retrieval of Sequential Data in Heterogeneous XML Databases , 2005, Adaptive Multimedia Retrieval.

[5]  Denilson Barbosa,et al.  The XML web: a first study , 2003, WWW '03.

[6]  M. de Rijke,et al.  The Importance of Length Normalization for XML Retrieval , 2005, Information Retrieval.

[7]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[8]  Charles L. A. Clarke,et al.  INEX 2006 retrieval task and result submission specification , 2006 .

[9]  Andrew Trotman,et al.  INEX 2005 guidelines for topic development , 2005 .

[10]  SaltonGerard,et al.  Term-weighting approaches in automatic text retrieval , 1988 .

[11]  Norbert Fuhr,et al.  XIRQL: An XML query language based on information retrieval concepts , 2004, TOIS.

[12]  David Carmel,et al.  Searching XML documents via XML fragments , 2003, SIGIR.

[13]  Djoerd Hiemstra,et al.  TIJAH Scratches INEX 2005. Vague Element Selection, Overlap, Image Search, Relevance Feedback, and Users (Notebook paper) , 2006 .

[14]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[15]  Steven J. DeRose,et al.  XML Path Language (XPath) Version 1.0 , 1999 .

[16]  Pierre-François Marteau,et al.  Information retrieval in heterogeneous XML knowledge bases , 2002 .

[17]  Kuo-Chung Tai,et al.  The Tree-to-Tree Correction Problem , 1979, JACM.

[18]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[19]  Kaizhong Zhang,et al.  An Algorithm for Finding the Largest Approximately Common Substructures of Two Trees , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[20]  Andrew Trotman,et al.  Narrowed Extended XPath I (NEXI) , 2004, INEX.