Distributed IR for Digital Libraries

This paper examines technology developed to support large-scale distributed digital libraries. We describe the method used for harvesting collection information using standard information retrieval protocols and how this information is used in collection ranking and retrieval. The system that we have developed takes a probabilistic approach to distributed information retrieval using a Logistic regression algorithm for estimation of distributed collection relevance and fusion techniques to combine multiple sources of evidence. We discuss the harvesting method used and how it can be employed in building collection representatives using features of the Z39.50 protocol. The extracted collection representatives are ranked using a fusion of probabilistic retrieval methods. The effectiveness of our algorithm is compared to other distributed search methods using test collections developed for distributed search evaluation. We also describe how this system in currently being applied to operational systems in the U.K.