Effective XML content and structure retrieval with relevance ranking

XML documents can be retrieved by means of not only content-only (CO) queries, but also content-and-structure (CAS) queries. Though promising better retrieval precision, CAS queries introduce several new challenges. To address these challenges, we propose a novel approach for XML CAS retrieval. The distinctive feature of the approach is that it adopts a content-oriented point of view. Specifically, the approach first decomposes a CAS query into several fragments, then retrieves results for each query fragment in a content-centric way, and finally scores each answer node. The approach is adaptive to versatile homogeneous and heterogeneous data environments. To assess the relevance of retrieval results to a query fragment, we present a scoring strategy that measures relevance from both content and structure perspectives. In addition, an effective approach is proposed to infer answer nodes based on the CAS query and document structure. An efficient algorithm is also presented for CAS retrieval. Finally, we demonstrate the effectiveness of the proposed methods through comprehensive experimental studies.

[1]  Maarten de Rijke,et al.  XML retrieval: what to retrieve? , 2003, SIGIR '03.

[2]  Sihem Amer-Yahia,et al.  XML retrieval: db/ir in theory, web in practice , 2007, VLDB.

[3]  Sihem Amer-Yahia,et al.  Structure and Content Scoring for XML , 2005, VLDB.

[4]  Ziyang Liu,et al.  Query biased snippet generation in XML search , 2008, SIGMOD Conference.

[5]  Yi Chen,et al.  Identifying meaningful return information for XML keyword search , 2007, SIGMOD '07.

[6]  Tok Wang Ling,et al.  Effective XML Keyword Search with Relevance Oriented Ranking , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[7]  Rafael Berlanga Llavori,et al.  Fragment-based approximate retrieval in highly heterogeneous XML collections , 2008, Data Knowl. Eng..

[8]  Wesley W. Chu,et al.  Configurable indexing and ranking for XML information retrieval , 2004, SIGIR '04.

[9]  Noriko Kando,et al.  An empirical study on retrieval models for different document genres: patents and newspaper articles , 2003, SIGIR '03.

[10]  Sihem Amer-Yahia,et al.  Tree Pattern Relaxation , 2002, EDBT.

[11]  Laks V. S. Lakshmanan,et al.  FleXPath: flexible structure and full-text querying for XML , 2004, SIGMOD '04.

[12]  Cong Yu,et al.  Schema summarization , 2006, VLDB.

[13]  Yi Chen,et al.  Reasoning and identifying relevant matches for XML keyword search , 2008, Proc. VLDB Endow..

[14]  Gabriella Kazai,et al.  Overview of the Initiative for the Evaluation of XML retrieval (INEX) 2002 , 2002, INEX Workshop.

[15]  M. de Rijke,et al.  Structured queries in XML retrieval , 2005, CIKM '05.

[16]  Jaap Kamps,et al.  The Effect of Structured Queries and Selective Indexing on XML Retrieval , 2005, INEX.

[17]  Sihem Amer-Yahia,et al.  XML search: languages, INEX and scoring , 2006, SGMD.

[18]  Shlomo Geva GPX - Gardens Point XML Information Retrieval at INEX 2004 , 2004, INEX.

[19]  Andrew Trotman,et al.  Narrowed Extended XPath I (NEXI) , 2004, INEX.

[20]  Wesley W. Chu,et al.  Vague Content and Structure (VCAS) Retrieval for Document-centric XML Collections , 2005, WebDB.

[21]  Jaap Kamps,et al.  The University of Amsterdam at INEX 2006 , 2002 .