XML Information Retrieval through Tree Edit Distance and Structural Summaries

Semi-structured Information Retrieval (SIR) allows the user to narrow his search down to the element level. As queries and XML documents can be seen as hierarchically nested elements, we consider that their structural proximity can be evaluated through their trees similarity. Our approach combines both content and structure scores, the latter being based on tree edit distance (minimal cost of operations to turn one tree to another). We use the tree structure to propagate and combine both measures. Moreover, to overcome time and space complexity, we summarize the document tree structure. We experimented various tree summary techniques as well as our original model using the SSCAS task of the INEX 2005 campaign. Results showed that our approach outperforms state of the art ones.

[1]  Timos K. Sellis,et al.  A methodology for clustering XML documents by structure , 2006, Inf. Syst..

[2]  Andrew Trotman,et al.  Narrowed Extended XPath I (NEXI) , 2004, INEX.

[3]  Kuo-Chung Tai,et al.  The Tree-to-Tree Correction Problem , 1979, JACM.

[4]  Letizia Tanca,et al.  Fuzzy XML queries via context-based choice of aggregations , 2000, Kybernetika.

[5]  Andrew Trotman,et al.  Overview of the INEX 2009 Ad Hoc Track , 2009, INEX.

[6]  Andrew Trotman,et al.  Focused Retrieval and Evaluation, 8th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2009, Brisbane, Australia, December 7-9, 2009, Revised and Selected Papers , 2010, INEX.

[7]  Erik D. Demaine,et al.  An optimal decomposition algorithm for tree edit distance , 2006, TALG.

[8]  Mounia Lalmas,et al.  Advances in XML Information Retrieval, Third International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2004, Dagstuhl Castle, Germany, December 6-8, 2004, Revised Selected Papers , 2005, INEX.

[9]  Gerhard Weikum,et al.  TopX and XXL at INEX 2005 , 2005, INEX.

[10]  Andrew Trotman Narrowed Extended XPath I , 2009, Encyclopedia of Database Systems.

[11]  Gabriella Kazai,et al.  TopX & XXL at INEX 2005 (Ad-Hoc Track) , 2006 .

[12]  Hélène Touzet,et al.  Analysis of Tree Edit Distance Algorithms , 2003, CPM.

[13]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[14]  Zohra Bellahsene Database and XML Technologies, 6th International XML Database Symposium, XSym 2009, Lyon, France, August 24, 2009. Proceedings , 2009, XSym.

[15]  Gabriella Kazai,et al.  Advances in XML Information Retrieval and Evaluation, 4th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2005, Dagstuhl Castle, Germany, November 28-30, 2005, Revised Selected Papers , 2006, INEX.

[16]  Karen Spärck Jones Index term weighting , 1973, Inf. Storage Retr..

[17]  Michel de Rougemont,et al.  Approximate schemas, source-consistency and query answering , 2008, Journal of Intelligent Information Systems.

[18]  G. Italiano,et al.  Algorit[h]ms - ESA '98 : 6th Annual European Symposium, Venice, Italy, August 24-26, 1998 : proceedings , 1998 .

[19]  Fabrizio Grandoni,et al.  Resilient dictionaries , 2009, TALG.

[20]  Abdelhamid Bouchachia,et al.  Searching XML Documents - Preliminary Work , 2005, INEX.

[21]  Philip N. Klein,et al.  Computing the Edit-Distance between Unrooted Ordered Trees , 1998, ESA.

[22]  Gabriella Kazai,et al.  INEX 2005 Evaluation Measures , 2005, INEX.

[23]  Mohand Boughanem,et al.  Flexible document-query matching based on a probabilistic content and structure score combination , 2010, SAC '10.

[24]  Pierre-François Marteau,et al.  SIRIUS: A Lightweight XML Indexing and Approximate Search System at INEX 2005 , 2005, INEX.

[25]  Andrew Trotman,et al.  Report on INEX 2008 , 2009, SIGF.

[26]  Yosi Mass,et al.  Component Ranking and Automatic Query Refinement for XML Retrieval , 2004, INEX.

[27]  Michel de Rougemont,et al.  Correctors for XML Data , 2004, XSym.