Structured Document Retrieval

DEFINITION Structured document retrieval is concerned with the retrieval of document fragments. The structure of the document, whether explicitly provided by a markup language or derived, is exploited to determine the most relevant document fragments to return as answers to a given query. The identified most relevant document fragments can themselves be used to determine the most relevant documents to return as answers to the given query. MAIN TEXT The aim of this entry is to clarify different terminologies that have been used to refer to or are strongly related to structured retrieval and semi-structured data. The term " structured document retrieval " , which was introduced in the early to mid 90s in the information retrieval community, refers to " passage retrieval " and " structured text retrieval ". In passage retrieval, documents are first decomposed into passages (e.g. fixed-size text-windows of words, fixed discourses such as paragraphs, or topic segments through the application of a topic segmentation algorithm). Passages could themselves be retrieved as answers to a query, or be used to rank documents as answers to the query. Structured text retrieval is concerned with the developments of models for querying and retrieving from structured text, where the structure is usually encoded with the use of markup languages, such as SGML, and now predominantly XML. Indeed, text documents often display structural information. For example, a scientific article will have a so-called logical structure, such as an abstract, several sections and subsections, each of which composed of paragraphs. A book will have a so-called layout structure, such as pages and columns. Structured text retrieval is to be contrasted to traditional text retrieval, where the latter is concerned with the retrieval of unstructured text – so-called " raw text " or " flat text ". The use of the term " structured " in " structured text retrieval " is there to emphasize the interest in the structure. Furthermore, structured text retrieval aims to exploit the available structural information to return text fragments (e.g. XML elements) as opposed to entire text documents. The term " semi-structured " comes mainly from the database community. Traditional database technologies, such as relational databases, have been concerned with the querying and retrieval of highly structured data (e.g. from a student table, find the names and addresses of those with a grade over 80 in a particular subject). Text documents marked-up, for instance, in XML …