SIRIUS XML IR System at INEX 2006: Approximate Matching of Structure and Textual Content

In this paper we report on the retrieval approach taken by the VALORIA laboratory of the University of South-Brittany while participating at INEX 2006 ad-hoc track with the SIRIUS XML IR system. SIRIUS retrieves relevant XML elements by approximate matching both the content and the structure of the XML documents. A weighted editing distance on XML paths is used to approximately match the documents structure while the IDF of the researched terms are used to rank the textual content of the retrieved elements. We briefly describe the approach and the extensions made to the SIRIUS XML IR system to address each of the four subtasks of the INEX 2006 ad-hoc track. Finally we present and analyze the SIRIUS retrieval evaluation results. SIRIUS runs were ranked on the 1st position out of 77 submitted runs for the Best In Context task and obtained several top ten results for both the Focused and All In Context tasks.

[1]  Sameer Pradhan,et al.  Evaluation Metrics , 2007 .

[2]  Andrew Trotman,et al.  Narrowed Extended XPath I (NEXI) , 2004, INEX.

[3]  Steven J. DeRose,et al.  XML Path Language (XPath) Version 1.0 , 1999 .

[4]  Mihaela Juganaru-Mathieu,et al.  Classifying XML tags through "reading contexts" , 2005, DocEng '05.

[5]  Pierre-François Marteau,et al.  Information retrieval in heterogeneous XML knowledge bases , 2002 .

[6]  Pierre-François Marteau,et al.  SIRIUS: A Lightweight XML Indexing and Approximate Search System at INEX 2005 , 2005, INEX.

[7]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[8]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[9]  Charles L. A. Clarke,et al.  INEX 2006 retrieval task and result submission specification , 2006 .

[10]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[11]  Jaap Kamps,et al.  The University of Amsterdam at INEX 2006 , 2002 .

[12]  Journal of the Association for Computing Machinery , 1961, Nature.

[13]  Sihem Amer-Yahia,et al.  Structure and Content Scoring for XML , 2005, VLDB.

[14]  Benjamin Piwowarski,et al.  Measurement, Theory , 2022 .

[15]  Denilson Barbosa,et al.  The XML web: a first study , 2003, WWW '03.

[16]  Mounia Lalmas,et al.  Advances in XML Information Retrieval, Third International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2004, Dagstuhl Castle, Germany, December 6-8, 2004, Revised Selected Papers , 2005, INEX.

[17]  David Carmel,et al.  Searching XML documents via XML fragments , 2003, SIGIR.

[18]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[19]  Gabriella Kazai,et al.  Advances in XML Information Retrieval and Evaluation, 4th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2005, Dagstuhl Castle, Germany, November 28-30, 2005, Revised Selected Papers , 2006, INEX.