Distributed processing of queries for XML documents in an agent based information retrieval system

The paper addresses the problem of efficiently querying large numbers of text documents using parallel processing methods. The optimization criteria are somewhat different from those used in querying heterogeneous databases, largely because the extraction of ontological information from documents is the dominant component of query execution time. We assume that each document has been previously annotated using XML. The authors describe the architecture of a system to process ontology based queries for XML annotated documents. We have introduced two basic strategies for query processing: simple strategy, and semi-join strategy, and their possible extensions using pipelining and longer lists for keyword search. Different levels of parallelism for these strategies are discussed. An evaluation model is created and used to derive optimal replication of resource agents. The theoretical and experimental results are compared.