Ensuring Retrieval Effectiveness in Distributed Digital Libraries

Abstract We find that dissemination of collection-wide information (CWI) in a distributed collection of documents is needed to achieve retrieval effectiveness comparable to that of a centralized collection. Complete dissemination is unnecessary. The required dissemination level depends upon the content-skew of the distributed collection, i.e., how documents are allocated among sites. Low dissemination is needed for random document allocation, but higher levels are needed when documents are allocated on the basis of their content. We define parameters to control dissemination and document allocation and present results from four document collections. These results provide insight into the necessary technology underlying digital libraries. We also describe the architecture of the Networked Computer Science Technical Report Library (NCSTRL), a concrete example of a system that fits our model of a distributed archive.

[1]  Donna K. Harman,et al.  Overview of the First Text REtrieval Conference (TREC-1) , 1992, TREC.

[2]  Peter B. Danzig,et al.  The Harvest Information Discovery and Access System , 1995, Comput. Networks ISDN Syst..

[3]  James C. French,et al.  Dissemination of collection wide information in a distributed information retrieval system , 1995, SIGIR '95.

[4]  Alistair Moffat,et al.  An Efficient Indexing Technique for Full Text Databases , 1992, Very Large Data Bases Conference.

[5]  J. C. French DIRE: an approach to improving informal scientific communication , 1994 .

[6]  E. A. Fox,et al.  Combining the Evidence of Multiple Query Representations for Information Retrieval , 1995, Inf. Process. Manag..

[7]  W. Bruce Croft,et al.  Searching distributed collections with inference networks , 1995, SIGIR '95.

[8]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[9]  Zygmunt Mazur On a model of distributed information retrieval systems based on thesauri , 1984, Inf. Process. Manag..

[10]  Carl Lagoze,et al.  "Drop-In" Publishing with the World Wide Web , 1995, Comput. Networks ISDN Syst..

[11]  Edward A. Fox,et al.  World-Wide Web and computer science reports , 1995, CACM.

[12]  Hector Garcia-Molina,et al.  Incremental updates of inverted lists for text document retrieval , 1994, SIGMOD '94.

[13]  Charles L. Viles,et al.  Maintaining State in a Distributed Information Retrieval System , 1994 .

[14]  Donna K. Harman,et al.  Prototyping a distributed information retrieval system that uses statistical ranking , 1991, Inf. Process. Manag..

[15]  Ellen M. Voorhees,et al.  Learning collection fusion strategies , 1995, SIGIR '95.

[16]  Donna K. Harman,et al.  Overview of the Third Text REtrieval Conference (TREC-3) , 1995, TREC.

[17]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[18]  Donna K. Harman,et al.  Overview of the Second Text REtrieval Conference (TREC-2) , 1994, HLT.

[19]  Luis Gravano,et al.  The Efficacy of GlOSS for the Text Database Discovery Problem , 1993, SIGMOD 1993.

[20]  James C. French,et al.  On the update of term weights in dynamic information retrieval systems , 1995, CIKM '95.

[21]  Carl Lagoze,et al.  Dienst: an architecture for distributed document libraries , 1995, CACM.

[22]  Donna Harman,et al.  The Second Text Retrieval Conference (TREC-2) , 1995, Inf. Process. Manag..

[23]  W. Bruce Croft,et al.  Fast Incremental Indexing for Full-Text Information Retrieval , 1994, VLDB.