论文信息 - Harvesting needed to maintain scientific literature online

Harvesting needed to maintain scientific literature online

Millions of scientific articles are accessible freely on the web. While some of them are stored in institutional repositories many are made available on personal pages which are exposed to the net's transience. We found that nearly 11% of URLs of PDF documents containing references to life science publications were not accessible within 5 months after being harvested using a search engine's (SE) API. For most of them (8.4%) no SE cache backup could be found. Although we have yet to estimate the exact rate at which the scientific literature disappears and the duration of its disappearance the results so far are a clear indicator that web harvesting is needed to preserve the online scientific literature.

Peter Stoehr | Nikolay Nikolov

[1] Michael L. Nelson,et al. Search engines and their public interfaces: which apis are the most synchronized? , 2007, WWW '07.