Search engine coverage of the OAI-PMH corpus

Having indexed much of the "surface" Web, search engines are now using various approaches to index the "deep" Web. At the same time, institutional repositories and digital libraries are adopting the open archives initiative protocol for metadata harvesting (OAI-PMH) to expose their holdings. The authors harvested nearly 10 million records from OAI-PMH repositories. From these records, they extracted 3.3 million unique resource URLs and then conducted searches on samples from this collection to determine how much of the OAI-PMH corpus the three major search engines have indexed.

[1]  Antonio Gulli,et al.  The indexable web is more than 11.5 billion pages , 2005, WWW '05.

[2]  Larry Lannom,et al.  Handle System Overview , 2003, RFC.

[3]  Herbert Van de Sompel,et al.  mod_oai: An Apache Module for Metadata Harvesting , 2005, ECDL.

[4]  Michael K. Bergman White Paper: The Deep Web: Surfacing Hidden Value , 2001 .

[5]  Tobias Dönz Extracting Structured Data from Web Pages , 2003 .

[6]  Norman Paskin E‐citations: actionable identifiers and scholarly referencing , 2000, Learn. Publ..

[7]  Herbert Van de Sompel,et al.  The Santa Fe Convention of the Open Archives Initiative , 2000, D Lib Mag..

[8]  Wang Jun Open Archives Initiative Protocol for Metadata Harvesting , 2005 .

[9]  Kurt Maly,et al.  DP9: an OAI gateway service for web crawlers , 2002, JCDL '02.

[10]  Hector Garcia-Molina,et al.  Crawler-Friendly Web Servers , 2000, PERV.

[11]  B. Huberman,et al.  The Deep Web : Surfacing Hidden Value , 2000 .

[12]  Sriram Raghavan,et al.  Crawling the Hidden Web , 2001, VLDB.

[13]  Hector Garcia-Molina,et al.  Extracting structured data from Web pages , 2003, SIGMOD '03.

[14]  Petros Zerfos,et al.  Downloading textual hidden web content through keyword queries , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[15]  Herbert Van de Sompel,et al.  Using the OAI-PMH ... Differently , 2003, D Lib Mag..

[16]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.