Volume 5 Issue 1
Spring 2009
ISSN 1937-7266

Harvesting Needed To Maintain Scientific Literature Online

Nikolay Nikolov

Dept. Of Applied Mathematics and Theoretical Physics
University of Cambridge
Wilberforce Road, Cambridge CB3 0WA UK

Peter Stoehr

EMBL-European Bioinformatics Institute
Wellcome Trust Genome Campus - Hinxton, Cambridge, UK


Millions of scientific articles are accessible freely on the web. While some of them are stored in institutional repositories, many are made available on personal pages which are exposed to the net’s transience. We found that nearly 11% of URLs of PDF documents containing references to life science publications were not accessible within 5 months after being harvested using a search engine’s (SE) API. For most of them (8.4%) no SE cache backup could be found. Although we have yet to estimate the exact rate at which the scientific literature disappears and the duration of its disappearance, the results so far are a clear indicator that web harvesting is needed to preserve the online scientific literature.

Categories and Subject Descriptors

H.3.7 [Information Storage and Retrieval]: Digital Libraries

General Terms

Management, Design, Reliability.



Preservation, Availability, Scientific Literature.


We undertook a large-scale harvesting of URLs of life science literature that are freely available on the web. As harvesting seeds we used the titles of +4 million life science journal articles published since 2000 for which metadata records existed in PubMed (http://www.ncbi.nlm.nih.gov/sites/entrez) or Agricola (http://agricola.nal.usda.gov). They were submitted as phrasal queries using a SE’s API (http://developer.yahoo.com). The queries were limited to documents in PDF format, the most widely used format for scientific publications. In addition, only deduplicated results were requested (i.e. other URLs referring to copies of the same document were not returned). The retrieved URLs were stored. The harvesting was completed within 3 months and yielded +4.3 million URLs; the vast majority of these were found to be false positives as result of some title queries with low selectivity. To filter out the false positives only URLs were considered that were hit by multiple queries (i.e. contained the strings of multiple article titles – mostly as part of a bibliography).

This yielded much smaller set of +780,000 URLs. For a random sample of 250 URLs a download was attempted 5 months after the start of the harvesting. 27 URLs (10.8%) were found inaccessible. Of the rest (89.2%), 68.4% were scientific publications and 20.8% scientific non-publications (e.g. recommended readings, CVs). For only 6 of the inaccessible URLs a cached version was retrievable through SE’s API. Thus the documents referred to by 21 (8.4%) of the sample URLs appear to have become inaccessible. Judging from the distribution of the PDF documents referred to by the accessible URLs, most of these documents must have been scientific publications.


We cannot judge by these preliminary results the true number of scientific publications that disappear from the web as some of them might exist on several sites which might have been ignored by the deduplication function of the API. In addition, it is hard to estimate the rate of URL deaths as SE’s API indexes are known to be somewhat stale [1]. We intend to undertake regular non-deduplicated harvesting to give more precise answer to these questions. But we believe that the results, nevertheless, indicate that URL death is a real problem and it might affect a significant part of the scientific literature available online. We see harvesting as a way to address this problem. While some efforts have been made by the Internet Archive (www.archive.org), they appear inadequate as their crawl frequency is too low and the harvesting focuses on HTM L documents, rather than PDFs. Academic efforts like CiteSeer (http://citeseer.ist.psu.edu) and Rexa (http://rexa.info) focus only on publications from some disciplines (mostly computer science). On other hand, as the results show, SEs do not cache all documents and even if they do it’s not the original PDF format but HTML which lacks some content (e.g. images, tables). In addition, commercial SEs may change their terms of use at any moment. Google did that in 2005 by effectively limiting third parties to creating a GUI interface to their indexes (the company states that existing users of their older, more generous API won’t be affected). And although Google indexes scientific publications as part of Google Scholar (http://scholar.google.com), it limits the access to them [2]. This makes it clear that a more rigorous non-profit harvesting effort is needed to preserve the online scientific literature.


Fig. 1 Dataflow of the study and distribution of the harvested URLs



[1] Frank McCown , Michael L. Nelson, "Search engines and their public interfaces: which apis are the most synchronized?," Proceedings of the 16th international conference on World Wide Web, May 08-12, 2007, Banff, Alberta, Canada [ doi>10.1145/1242572.1242763]

[2] “Google blocks academic” http://tinyurl.com/2j7alg last retrieved on February 3, 2008