Maintaining retrieval effectiveness in distributed, dynamic information retrieval systems

We present a rigorous empirical study investigating how allowing the use of subset-derived collection statistics influences retrieval effectiveness. We give a generic model for searching a document collection that allows for the use of collection statistics derived from a subset of the collection. Within this model, we identify two realistic scenarios requiring the use of subset-derived collection statistics. The first involves distributed document databases and the second involves ad-hoc search in dynamic document databases. We view the distributed document archive as a set of collections the members of which know about some fraction of the other members in the archive. Document collections are built empirically using standard IR test collections and parametrically assigning these documents to a collection in the system. Our results show that content-skew has a pronounced negative affect on retrieval effectiveness. Content-skew is the degree to which the holdings at a particular site differ from those at another site or a globally-defined "central" site. Highly skewed document collections require more knowledge about the global collection than those that are content-uniform. However, even in highly skewed systems, sites can know about a relatively small fraction of the holdings at other sites without pronounced degradations in search quality. We model the dynamic document archive as two collections, an "old" collection with complete statistics available, and a "new" collection composed of recently inserted documents that have not yet been incorporated into the document index and collection statistics. Our results show that retrieval effectiveness is maintained for "new" collections of realistic size when statistics from the "old" collection are used. The only problematic situation is when terms are introduced into the "new" collection that are not contained in the "old" collection. We also give two methods for measuring content skew directly, one based on topic identification and the other based on the well-known inverse document frequency statistics. We use one or both of these methods to measure the content-skew of three kinds of document archives: our empirically defined collections, the TREC collection, and the Networked Computer Science Technical Report Library (NCSTRL), an operational distributed archive.

[1]  James C. French,et al.  On the update of term weights in dynamic information retrieval systems , 1995, CIKM '95.

[2]  Carl Lagoze,et al.  Dienst: an architecture for distributed document libraries , 1995, CACM.

[3]  Frank Kappe,et al.  A Scalable Architecture for Maintaining Referential Integrity in Distributed Information Systems , 1995, J. Univers. Comput. Sci..

[4]  David R. Karger,et al.  Scatter/Gather: a cluster-based approach to browsing large document collections , 1992, SIGIR '92.

[5]  Yelena Yesha,et al.  ALIBI: a novel approach to resource discovery , 1995, Internet Res..

[6]  Charles L. Viles,et al.  Maintaining State in a Distributed Information Retrieval System , 1994 .

[7]  Edward A. Fox,et al.  Wide Area Technical Report Service , 1994, WWW Spring 1994.

[8]  W. Bruce Croft,et al.  Fast Incremental Indexing for Full-Text Information Retrieval , 1994, VLDB.

[9]  Gerard Salton,et al.  Document Length Normalization , 1995, Inf. Process. Manag..

[10]  Garrison W. Cottrell,et al.  Automatic combination of multiple ranked retrieval systems , 1994, SIGIR '94.

[11]  David Ellis The Dilemma of Measurement in Information Retrieval Research , 1996, J. Am. Soc. Inf. Sci..

[12]  Donna K. Harman,et al.  Evaluation Issues in Information Retrieval , 1992, Inf. Process. Manag..

[13]  Jan O. Pedersen,et al.  Optimization for dynamic inverted index maintenance , 1989, SIGIR '90.

[14]  Betty Salzberg,et al.  Bulletin of the Technical Committee on Data Engineering , 1995 .

[15]  C.A. Lynch,et al.  Networked Information Resource Discovery: An Overview of Current Issues (Invited Paper) , 1995, IEEE J. Sel. Areas Commun..

[16]  Chris Buckley,et al.  New Retrieval Approaches Using SMART: TREC 4 , 1995, TREC.

[17]  Ellen M. Voorhees,et al.  Learning collection fusion strategies , 1995, SIGIR '95.

[18]  James R. Davis Creating a Networked Computer Science Technical Report Library, , 1995, D Lib Mag..

[19]  Jean Tague-Sutcliffe,et al.  Some Perspectives on the Evaluation of Information Retrieval Systems , 1996, J. Am. Soc. Inf. Sci..

[20]  Zygmunt Mazur On a model of distributed information retrieval systems based on thesauri , 1984, Inf. Process. Manag..

[21]  Alfonso F. Cardenas Analysis and performance of inverted data base structures , 1975, CACM.

[22]  Donna K. Harman,et al.  Overview of the Second Text REtrieval Conference (TREC-2) , 1994, HLT.

[23]  Carl Lagoze,et al.  "Drop-In" Publishing with the World Wide Web , 1995, Comput. Networks ISDN Syst..

[24]  Gerard Salton,et al.  Dynamic document processing , 1972, CACM.

[25]  Peter B. Danzig,et al.  Scalable Internet resource discovery: research problems and approaches , 1994, CACM.

[26]  Stephen P. Harter,et al.  Variations in Relevance Assessments and the Measurement of Retrieval Effectiveness , 1996, J. Am. Soc. Inf. Sci..

[27]  Peter Schäuble,et al.  SPIDER: a multiuser information retrieval system for semistructured and dynamic data , 1993, SIGIR.

[28]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[29]  Edward A. Fox,et al.  Wide area technical report service—technical reports online , 1994, SIGA.

[30]  James C. French,et al.  Ensuring Retrieval Effectiveness in Distributed Digital Libraries , 1996, J. Vis. Commun. Image Represent..

[31]  Hector Garcia-Molina,et al.  Associate Editors , 2003, Molecular biology and evolution.

[32]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[33]  James C. French,et al.  TREC-4 Experiments using DRIFT , 1995, TREC.

[34]  Joann J. Ordille,et al.  Database challenges in global information systems , 1993, SIGMOD '93.

[35]  Luis Gravano,et al.  The Effectiveness of GlOSS for the Text Database Discovery Problem , 1994, SIGMOD Conference.

[36]  Frans Sijstermans,et al.  High-quality and high-performance full-text document retrieval: the Parallel InfoGuide System , 1991, [1991] Proceedings of the First International Conference on Parallel and Distributed Information Systems.

[37]  Peter B. Danzig,et al.  Harvest: A Scalable, Customizable Discovery and Access System , 1994 .

[38]  Ron Sacks-Davis,et al.  An e cient indexing technique for full-text database systems , 1992, VLDB 1992.

[39]  Michael J. Fischer,et al.  Sacrificing serializability to attain high availability of data in an unreliable network , 1982, PODS.

[40]  Luis Gravano,et al.  Precision and recall of GlOSS estimators for database discovery , 1994, Proceedings of 3rd International Conference on Parallel and Distributed Information Systems.

[41]  Thomas Erickson,et al.  Interfaces for Distributed Systems of Information Servers , 1993, J. Am. Soc. Inf. Sci..

[42]  W. Bruce Croft,et al.  Inference networks for document retrieval , 1989, SIGIR '90.

[43]  Christos Faloutsos,et al.  Access methods for text , 1985, CSUR.

[44]  Peter B. Danzig,et al.  Internet resource discovery services , 1993, Computer.

[45]  Vijay V. Raghavan,et al.  A critical investigation of recall and precision as measures of retrieval system performance , 1989, TOIS.

[46]  Yelena Yesha,et al.  Towards Flexible Distributed Information Retrieval , 1993, Advanced Database Systems.

[47]  Mark A. Sheldon Content routing: a scalable architecture for network-based information discovery , 1995 .

[48]  B. Clifford Neuman,et al.  A Comparison of Internet Resource Discovery Approaches , 1992, Comput. Syst..

[49]  Gerard Salton,et al.  Dynamic information and library processing , 1975 .

[50]  Robert B. Kellogg,et al.  Text to hypertext: can clustering solve the problem in digital libraries? , 1996, DL '96.

[51]  Donna K. Harman,et al.  Overview of the First Text REtrieval Conference (TREC-1) , 1992, TREC.

[52]  Peter B. Danzig,et al.  The Harvest Information Discovery and Access System , 1995, Comput. Networks ISDN Syst..

[53]  James C. French,et al.  Dissemination of collection wide information in a distributed information retrieval system , 1995, SIGIR '95.

[54]  J. C. French DIRE: an approach to improving informal scientific communication , 1994 .

[55]  W. Bruce Croft,et al.  TREC and Tipster Experiments with Inquery , 1995, Inf. Process. Manag..

[56]  W. Bruce Croft,et al.  Searching distributed collections with inference networks , 1995, SIGIR '95.

[57]  Edward A. Fox,et al.  Characterization of Two New Experimental Collections in Computer and Information Science Containing Textual and Bibliographic Concepts , 1983 .

[58]  E. A. Fox,et al.  Combining the Evidence of Multiple Query Representations for Information Retrieval , 1995, Inf. Process. Manag..

[59]  James Blustein,et al.  A Statistical Analysis of the TREC-3 Data , 1995, TREC.

[60]  IJsbrand Jan Aalbersberg Posting compression in dynamic retrieval environments , 1991, SIGIR '91.

[61]  Mark A. Sheldon,et al.  Content Routing for Distributed Information Servers , 1994, EDBT.

[62]  Hector Garcia-Molina,et al.  Index structures for information filtering under the vector space model , 1994, Proceedings of 1994 IEEE 10th International Conference on Data Engineering.

[63]  Peter G. Anick,et al.  Integrating a dynamic lexicon with a dynamic full-text retrieval system , 1993, SIGIR.

[64]  James C. French,et al.  Availability and Latency of World Wide Web Information Servers , 2001, Comput. Syst..

[65]  Patrick Martin,et al.  A case study of caching strategies for a distributed full text retrieval system , 1990, Inf. Process. Manag..

[66]  Patrick Martin,et al.  Data caching strategies for distributed full text retrieval systems , 1991, Inf. Syst..

[67]  Zygmunt Mazur Models of a Distributed Information Retrieval System Based on Thesauri with Weights , 1994, Inf. Process. Manag..

[68]  Edward A. Fox,et al.  Combining Evidence from Multiple Searches , 1992, TREC.

[69]  Hector Garcia-Molina,et al.  Incremental updates of inverted lists for text document retrieval , 1994, SIGMOD '94.

[70]  Gerard Salton,et al.  The State of Retrieval System Evaluation , 1992, Inf. Process. Manag..

[71]  Luis Gravano,et al.  Generalizing GlOSS to Vector-Space Databases and Broker Hierarchies , 1995, VLDB.

[72]  Donna K. Harman,et al.  Overview of the Third Text REtrieval Conference (TREC-3) , 1995, TREC.

[73]  Chris Buckley,et al.  Pivoted Document Length Normalization , 1996, SIGIR Forum.

[74]  Nicholas J. Belkin,et al.  The effect multiple query representations on information retrieval system performance , 1993, SIGIR.

[75]  Oliver A. McBryan,et al.  GENVL and WWWW: Tools for taming the Web , 1994, WWW Spring 1994.

[76]  Craig Stanfill Partitioned posting files: a parallel inverted file structure for information retrieval , 1989, SIGIR '90.

[77]  Andrzej Duda,et al.  Content routing in a network of WAIS servers , 1994, 14th International Conference on Distributed Computing Systems.

[78]  Michael Persin,et al.  Document filtering for fast ranking , 1994, SIGIR '94.

[79]  Gerard Salton,et al.  Length Normalization in Degraded Text Collections , 1995 .

[80]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[81]  Hector Garcia-Molina,et al.  Index structures for selective dissemination of information under the Boolean model , 1994, TODS.