Harvesting: Broadening the Field of Distributed Information Retrieval

This chapter argues that in addition to federated search and gathering (as by Web crawlers), harvesting is an important approach to address the needs for distributed IR. We highlight the use of the Open Archives Initiative Protocol for Metadata Harvesting, illustrating its use in three projects: OAD, NDLTD, and CITIDEL. We explain how traditional services can be extended in a user-centered fashion, providing details of our new: ESSEX search engine, multischeming browsing, and quality-oriented filtering (using rules and SVMs). We conclude with an overview of work in progress on logging and component architectures, as well as a summary of our findings.

[1]  Herbert Van de Sompel,et al.  The OAI-PMH static repository and static repository gateway , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[2]  Edward A. Fox,et al.  Networked Digital Library of Theses and Dissertations (「ディジタル図書館」ワークショップ第15回(奈良先端科学技術大学院大学.1999年7月19日)) , 1999 .

[3]  Carl Lagoze,et al.  Dienst: an architecture for distributed document libraries , 1995, CACM.

[4]  Herbert Van de Sompel,et al.  The open archives initiative: building a low-barrier interoperability framework , 2001, JCDL '01.

[5]  Edward A. Fox,et al.  Networked Digital Library of Theses and Dissertations: Bridging the Gaps for Global Access - Part 1: Mission and Progress , 2001, D Lib Mag..

[6]  Aaron Phillip Krowne,et al.  An Architecture for Collaborative Math and Science Digital Libraries , 2003 .

[7]  Edward A. Fox,et al.  An XML Log Standard and Tool for Digital Library Logging Analysis , 2002, ECDL.

[8]  Edward A. Fox,et al.  The Open Archives Initiative , 2001 .

[9]  James A. Hendler,et al.  The Semantic Web: A new form of Web content that is meaningful to computers will unleash a revolution of new possibilities , 2001 .

[10]  Susan T. Dumais,et al.  Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[11]  Edward A. Fox,et al.  The XML log standard for digital libraries: analysis, evolution, and deployment , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[12]  Luis Gravano,et al.  Probe, count, and classify: categorizing hidden web databases , 2001, SIGMOD '01.

[13]  Edward A. Fox,et al.  Digital libraries , 1995, CACM.

[14]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[15]  Pasquale Pagano,et al.  OpenDLib: A Digital Library Service System , 2002, ECDL.

[16]  Edward A. Fox,et al.  Building Digital Libraries Made Easy: Toward Open Digital Libraries , 2002, ICADL.

[17]  Alberto H. F. Laender,et al.  The Web-DL environment for building digital libraries from the Web , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[18]  Press Niso Information Retrieval Application Service Definition and Protocol Specification for Open Systems Interconnection, Z39.50-1995 , 1994 .

[19]  Edward A. Fox,et al.  An Architecture for Multischeming in Digital Libraries , 2003, ICADL.

[20]  Pasquale Pagano,et al.  A system for building expandable digital libraries , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[21]  Thorsten Joachims,et al.  A statistical learning learning model of text classification for support vector machines , 2001, SIGIR '01.

[22]  Sandra Payette,et al.  Making global digital libraries work: collection services, connectivity regions, and collection views , 1998, DL '98.

[23]  Peter B. Danzig,et al.  The Harvest Information Discovery and Access System , 1995, Comput. Networks ISDN Syst..

[24]  Edward A. Fox,et al.  Open digital libraries , 2002 .

[25]  Thorsten Joachims,et al.  A Statistical Learning Model of Text Classification for Support Vector Machines. , 2001, SIGIR 2002.

[26]  Edward A. Fox,et al.  Open Archives: Distributed Services for Physicists and Graduate Students (OAD) , 2002 .

[27]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[28]  Alberto H. F. Laender,et al.  Collecting hidden weeb pages for data extraction , 2002, WIDM '02.

[29]  Thomas Rose Visual assessment of engineering processes in virtual enterprises , 1998, CACM.

[30]  Susan T. Dumais,et al.  Hierarchical classification of Web content , 2000, SIGIR '00.

[31]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[32]  Edward A. Fox,et al.  Networked Digital Library of Theses and Dissertations (NDLTD) , 2004 .

[33]  Gail McMillan,et al.  Open Archives Initiative , 2000 .

[34]  Philip S. Yu,et al.  Optimal crawling strategies for web search engines , 2002, WWW '02.

[35]  William E. Moen Accessing distributed cultural heritage information , 1998, CACM.

[36]  Edward A. Fox,et al.  Preservation and transition of NCSTRL using an OAI-based architecture , 2002, JCDL '02.

[37]  Edward A. Fox,et al.  ETD-ms: An Interoperability Metadata Standard for Electronic Theses and Dissertations , 2004 .