The Cost of a WARC: Analyzing Web Archives in the Cloud

The value of web archives to support scholarship in the humanities and social sciences is slowly being realized by the increasing availability of scalable tools and platforms. The cost of providing scholarly access is a critical component of developing a long-term sustainability strategy. This paper attempts to answer a straightforward question: How much does it cost to analyze web archives in the cloud? To make this question more concrete, we examine the creation of three derivatives (extraction of collection statistics, full text, and the webgraph) that serve as the starting points of many scholarly inquiries. Our analysis shows that these typical derivatives costs around US$7 per TB using our Archives Unleashed Toolkit. We describe in detail the methodology and assumptions made to arrive at this figure. To our knowledge, we are the first to quantify the economics of scholarly access to web archives, and we believe that this information is valuable for service planning by archives, libraries, and other institutions.

[1]  Niels Brügger,et al.  The archived web: Doing history in the digital age , 2018 .

[2]  Jimmy Lin,et al.  Warcbase: Scalable Analytics Infrastructure for Exploring Web Archives , 2017, JOCCH.

[3]  Jimmy J. Lin,et al.  Content selection and curation for web archiving: The gatekeepers vs. the masses , 2016, 2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL).

[4]  Avishek Anand,et al.  ArchiveSpark: Efficient Web archive access, extraction and derivation , 2016, 2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL).

[5]  Zhiwu Xie,et al.  Evaluating cost of cloud execution in a data repository , 2016, 2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL).

[6]  Ian Milligan Lost in the Infinite Archive: The Promise and Pitfalls of Web Archives , 2016, Int. J. Humanit. Arts Comput..

[7]  Jimmy Lin,et al.  Building Community and Tools for Analyzing Web Archives Through Datathons , 2019, 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL).