论文信息 - Structure Weight

Structure Weight

SYNONYM None DEFINITION In structured text retrieval, the structure of a text component may be used to estimate the relevance of that component. This is done by associating a weight to the structure reflecting its significance when estimating the relevance of the component for a given query. MAIN TEXT Associating weight to the structure of a component in itself is not new, and several investigations have been reported for whole document retrieval. This entry is concerned with structure weights in the context of structured text retrieval, where the aim is to exploit the document structure to return document components, instead of whole documents. In structured text retrieval, not all document components will trigger the same user satisfaction when returned as answers to queries. In the context of structured documents markup in XML, some document components, i.e. XML elements, may not be appropriate to return because they are too small, of a tag type that does not contain informative content, nested too deep in the document logical structure, or for other reasons. When ranking XML elements, their structure (size, tag type, path, depth, etc.) may prove important. The importance of the element structure is captured through a weight, which can be binary. Using binary weights means that an element is (value one) or is not (value zero) considered for indexing and retrieval. The decision can be made by looking at the DTD 1 of the collection, past relevance data, and/or the requirements of the application and user scenario. In the selective indexing strategy [3], only elements of types that were found to contain relevant content for previous query sets (relevance data) are considered. Any elements with a length size less than a given threshold can also be ignored. Weights can be assigned to characteristics of elements, such as length, depth, location in the document logical structure, and so on. For instance, within the language modelling framework, length has been used as a normalization parameter (weight) incorporated through a prior probability in the ranking formula [2]. With statistical approaches, the weights are estimated based on training data, such as past relevance data. The weights can be determined using machine learning, and then used in the ranking function. They can also be directly calculated based on the 1 Document Type Definition.

Mounia Lalmas

[1] Maarten de Rijke,et al. Length normalization in XML retrieval , 2004, SIGIR '04.

[2] Yosi Mass,et al. Component Ranking and Automatic Query Refinement for XML Retrieval , 2004, INEX.