Indexing and searching tera-scale Grid-Based Digital Libraries

The University of California, Berkeley and the University of Liverpool in conjunction with the San Diego Supercomputer Center are developing a framework for Grid-Based Digital Library systems and Information Retrieval Services (Cheshire3) that operates in both single-processor and distributed computing environments. In this paper we discuss some results of testing Grid-based parallel approaches in indexing and retrieval for a variety of information resources, ranging from small test collections like the TREC and INEX collections, to medium-scale metadata collections like Medline and a test version of University of California Online Union Catalog, MELVYL (with 15 million and 16.5 million records respectively) ranging up to large-scale collections like the US National Records and Archives Administration (NARA) Preservation Prototype. This paper examines our approaches to indexing and retrieving from these collections and the architecture of the system that supports them.

[1]  Ian Foster,et al.  The Grid 2 - Blueprint for a New Computing Infrastructure, Second Edition , 1998, The Grid 2, 2nd Edition.

[2]  Robert Sanderson,et al.  Grid-based digital libraries: cheshire3 and distributed retrieval , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[3]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[4]  Herbert Van de Sompel,et al.  Open Archives Initiative - Protocol for Metadata Harvesting - v.2.0 , 2002 .

[5]  Curtis E. A. Karnow,et al.  The Grid: Blueprint for a New Computing Infrastructure ed. by Ian Foster and Carl Kesselman (review) , 2017 .

[6]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[7]  Edward A. Lee,et al.  Scientific workflow management and the Kepler system , 2006, Concurr. Comput. Pract. Exp..

[8]  Paul O'Leary,et al.  Cheshire II: Designing a Next-Generation Online Catalog , 1996, J. Am. Soc. Inf. Sci..

[9]  Paul B. Watry,et al.  A No-Compromises Architecture for Digital Document Preservation , 2005, ECDL.

[10]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .

[11]  Ami Marowka,et al.  The GRID: Blueprint for a New Computing Infrastructure , 2000, Parallel Distributed Comput. Pract..

[12]  Margo I. Seltzer,et al.  Berkeley DB , 1999, USENIX Annual Technical Conference, FREENIX Track.

[13]  Francisco Curbera,et al.  Web Services Business Process Execution Language Version 2.0 , 2007 .

[14]  Jack Dongarra,et al.  PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing , 1995 .