论文信息 - Structured Text Retrieval Models

Structured Text Retrieval Models

Structured text retrieval models provide a formal definition or mathematical framework for querying semistructured textual databases. A textual database contains both content and structure. The content is the text itself, and the structure divides the database into separate textual parts and relates those textual parts by some criterion. Often, textual databases can be represented as marked up text, for instance as XML, where the XML elements define the structure on the text content. Retrieval models for textual databases should comprise three parts: 1) a model of the text, 2) a model of the structure, and 3) a query language [4]: The model of the text defines a tokenization into words or other semantic units, as well as stop words, stemming, synonyms, etc. The model of the structure defines parts of the text, typically a contiguous portion of the text called element, region, or segment, which is defined on top of the text model’s word tokens. The query language typically defines a number of operators on content and structure such as set operators and operators like “containing” and “contained-by” to model relations between content and structure, as well as relations between the structural elements themselves. Using such a query language, the (expert) user can for instance formulate requests like “I want a paragraph discussing formal models near to a table discussing the differences between databases and information retrieval”. Here, “formal models” and “differences between databases and information retrieval” should match the content that needs to be retrieved from the database, whereas “paragraph” and “table” refer to structural constraints on the units to retrieve. The features, structuring power, and the expressiveness of the query languages of several models for structured text retrieval are discussed below.

Djoerd Hiemstra | Ricardo Baeza-Yates

[1] W. Alink. XIRAF: an XML Information Retrieval Approach to Digital Forensics , 2005 .

[2] Forbes J. Burkowski. Retrieval activities in a database consisting of heterogeneous collections of structured text , 1992, SIGIR '92.

[3] David Carmel,et al. Searching XML documents via XML fragments , 2003, SIGIR.

[4] Jani Jaakkola. Nested text-region algebra , 1999 .

[5] Laks V. S. Lakshmanan,et al. FleXPath: flexible structure and full-text querying for XML , 2004, SIGMOD '04.

[6] Gaston H. Gonnet,et al. Mind Your Grammar: a New Approach to Modelling Text , 1987, VLDB.

[7] Ricardo A. Baeza-Yates,et al. Proximal nodes: a model to query document databases by content and structure , 1997, TOIS.

[8] Norbert Fuhr,et al. XIRQL: a query language for information retrieval in XML documents , 2001, SIGIR '01.

[9] Sihem Amer-Yahia,et al. Texquery: a full-text search extension to xquery , 2004, WWW '04.

[10] Charles L. A. Clarke,et al. An Algebra for Structured Text Search and a Framework for its Implementation , 1995, Comput. J..

[11] Airi Salminen. PAT expressions: an algebra for text search , 2007 .

[12] James P. Callan,et al. Hierarchical Language Models for XML Component Retrieval , 2004, INEX.

[13] Ricardo A. Baeza-Yates,et al. Integrating contents and structure in text retrieval , 1996, SGMD.

[14] Djoerd Hiemstra,et al. Score region algebra: building a transparent XML-R database , 2005, CIKM '05.