Contextualization from the Bibliographic Structure

Relevance scoring and estimation deals with both finding the relevant set of answers and ordering them according to the degree of their relevance to the user-intent. The traditional information retrieval (IR) systems successfully find and order the relevant documents and leave them to the users, who then have to locate the relevant information embedded somewhere within the document. In contrast, estimating relevance in semi-structured retrieval means not only retrieving and ordering the relevant documents but also locating the relevant information within the document as well. When it comes to semi-structured retrieval, the traditional IR style retrieval is simply insufficient. The main focus of this thesis is estimating relevance in a schema-agnostic environment. Here, “schema-agnostic” means that the schema or the structure exists explicitly within the documents but the user does not or need not know that schema. In such an environment, the structure is generally defined loosely, which means: (a) it can evolve over time, (b) it can constitute a large part of the data, and (c) it might exist seamlessly within the document. The natural question that comes into mind is, why is such a structure there at all? The structure in a schemaagnostic environment is there to be used by retrieval systems for several useful tasks. This thesis is about unveiling the capabilities of the structural constructs within semi-structured documents in schema-agnostic settings. Structural constructs can form what we call the structural context of the relevant item. A structural context builds up the internal and external contextual features of a semi-structured document. These contextual features help with a series of tasks. The work presented in this thesis contributes towards understanding and utilizing the contextual features in the retrieval of focused information in schema-agnostic settings. During the course of this study we have identified, implemented and experimented with several intuitive types of contextual features in semi-structured retrieval settings. Contextualization is the generic process of utilizing features in the structural context of the retrievable units in relevance scoring. The proposed retrieval approaches, based mainly on contextual features, exhibited notable improvements in retrieval effectiveness, during empirical analyses. The evaluations and empirical analyses are performed in several tasks, spread across different phases of this study. The tasks are performed by looking at different aspects and challenges of the semi-structured retrieval domain. The following tasks are performed at different phases of this study: ad-hoc tasks, granulation tasks, and standard tasks offered by INitiative for the Evaluation of Xml retrieval (INEX). The contributions of this thesis are also grouped by these tasks.