Metadata harvesting for content-based distributed information retrieval

We propose an approach to content-based Distributed Information Retrieval based on the periodic and incremental centralization of full-content indices of widely dispersed and autonomously managed document sources. Inspired by the success of the Open Archive Initiative’s (OAI) Protocol for metadata harvesting, the approach occupies middle ground between content crawling and distributed retrieval. As in crawling, some data move toward the retrieval process, but it is statistics about the content rather than content itself; this grants more efficient use of network resources and wider scope of application. As in distributed retrieval, some processing is distributed along with the data, but it is indexing rather than retrieval; this reduces the costs of content provision while promoting the simplicity, effectiveness, and responsiveness of retrieval. Overall, we argue that the approach retains the good properties of centralized retrieval without renouncing to costeffective, large-scale resource pooling. We discuss the requirements associated with the approach and identify two strategies to deploy it on top of the OAI infrastructure. In particular, we define a minimal extension of the OAI protocol which supports the coordinated harvesting of full-content indices and descriptive metadata for content resources. Finally, we report on the implementation of a proof-of-concept prototype service for multimodel content-based retrieval of distributed file collections. Introduction

[1]  Herbert Van de Sompel,et al.  Resource Harvesting within the OAI-PMH Framework , 2004, D Lib Mag..

[2]  James P. Callan,et al.  Query-based sampling of text databases , 2001, TOIS.

[3]  William R. Hersh,et al.  Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries , 2002 .

[4]  Elly Dijk Sharing grey literature by using OA-x , 2005 .

[5]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[6]  Edward A. Fox,et al.  Preservation and transition of NCSTRL using an OAI-based architecture , 2002, JCDL '02.

[7]  Jie Lu,et al.  Pruning long documents for distributed information retrieval , 2002, CIKM '02.

[8]  Sandra Payette,et al.  Pathways: augmenting interoperability across scholarly repositories , 2007, International Journal on Digital Libraries.

[9]  Kurt Maly,et al.  DP9: an OAI gateway service for web crawlers , 2002, JCDL '02.

[10]  Jie Lu,et al.  Federated Search of Text-Based Digital Libraries in Hierarchical Peer-to-Peer Networks , 2005, Workshop on Peer-to-Peer Information Retrieval.

[11]  Wolfgang Nejdl,et al.  Proceedings of the SIGIR Workshop on Peer-to-Peer Information Retrieval, 27th Annual International ACM SIGIR Conference, July 29, 2004, Sheffield, UK , 2004, Peer-to-Peer Information Retrieval.

[12]  Ronald Fagin,et al.  Static index pruning for information retrieval systems , 2001, SIGIR '01.

[13]  Herbert Van de Sompel,et al.  The open archives initiative: building a low-barrier interoperability framework , 2001, JCDL '01.

[14]  Jie Lu,et al.  Content-based retrieval in hybrid peer-to-peer networks , 2003, CIKM '03.

[15]  Fabio Crestani,et al.  Distributed Multimedia Information Retrieval , 2003, Lecture Notes in Computer Science.

[16]  Ray R. Larson Distributed IR for Digital Libraries , 2003, ECDL.

[17]  Fabio Crestani,et al.  Resource selection and data fusion in multimedia distributed digital libraries , 2003, SIGIR.

[18]  Luis Gravano,et al.  STARTS: Stanford proposal for Internet meta-searching , 1997, SIGMOD '97.

[19]  Herbert Van de Sompel,et al.  mod_oai: An Apache Module for Metadata Harvesting , 2005, ECDL.

[20]  Norbert Fuhr,et al.  The MIND Architecture for Heterogeneous Multimedia Federated Digital Libraries , 2003, Distributed Multimedia Information Retrieval.

[21]  Organización Internacional de Normalización ISO 23950 : Information and documentation -- Information retrieval (Z39.50) -- Application service definition and protocol specification , 1998 .

[22]  CallanJamie,et al.  Query-based sampling of text databases , 2001 .

[23]  Clifford A. Lynch Building the Infrastructure of Resource Sharing: Union Catalogs, Distributed Search, and Cross-Database Linkage , 1997, Libr. Trends.

[24]  James C. French,et al.  Comparing the performance of database selection algorithms , 1999, SIGIR '99.

[25]  Carl Lagoze,et al.  Core services in the architecture of the national science digital library (NSDL) , 2002, JCDL '02.

[26]  Edward A. Fox,et al.  Designing Protocols in Support of Digital Library Componentization , 2002, ECDL.

[27]  Bailey,et al.  Open Access Bibliography: Liberating Scholarly Literature with E-Prints and Open Access Journals , 2005 .

[28]  Gary Simons,et al.  The Open Language Archives Community: An Infrastructure for Distributed Archiving of Language Resources , 2003, Lit. Linguistic Comput..

[29]  Peter B. Danzig,et al.  The Harvest Information Discovery and Access System , 1995, Comput. Networks ISDN Syst..

[30]  Fabio Simeoni,et al.  Servicing the Federation: The Case for Metadata Harvesting , 2004, ECDL.

[31]  Sriram Raghavan,et al.  Search Middleware and the Simple Digital Library Interoperability Protocol , 2000, D Lib Mag..

[32]  Raym Crow,et al.  The case for institutional repositories : a SPARC position paper , 2002 .

[33]  Annemiek van der Kuil,et al.  The Dawning of the Dutch Network of Digital Academic REpositories (DARE): A Shared Experience , 2004 .

[34]  Nick Craswell,et al.  Methods for Distributed Information Retrieval , 2000 .

[35]  Hector Garcia-Molina,et al.  Comparing Hybrid Peer-to-Peer Systems , 2001, VLDB.

[36]  Panos Constantopoulos,et al.  Research and Advanced Technology for Digital Libraries , 2001, Lecture Notes in Computer Science.

[37]  Linh Thai Nguyen Static Index Pruning for Information Retrieval Systems: A Posting-Based Approach , 2009, LSDS-IR@SIGIR.