The Archives Unleashed Project: Technology, Process, and Community to Improve Scholarly Access to Web Archives

The Archives Unleashed project aims to improve scholarly access to web archives through a multi-pronged strategy involving tool creation, process modeling, and community building---all proceeding concurrently in mutually-reinforcing efforts. As we near the end of our initially-conceived three-year project, we report on our progress and share lessons learned along the way. The main contribution articulated in this paper is a process model that decomposes scholarly inquiries into four main activities: filter, extract, aggregate, and visualize. Based on the insight that these activities can be disaggregated across time, space, and tools, it is possible to generate "derivative products", using our Archives Unleashed Toolkit, that serve as useful starting points for scholarly inquiry. Scholars can download these products from the Archives Unleashed Cloud and manipulate them just like any other dataset, thus providing access to web archives without requiring any specialized knowledge. Over the past few years, our platform has processed over a thousand different collections from over two hundred users, totaling around 300 terabytes of web archives.

[1]  Rémi Emonet,et al.  Ten simple rules for collaborative lesson development , 2017, PLoS Comput. Biol..

[2]  Jimmy Lin,et al.  Scalable Content-Based Analysis of Images in Web Archives with TensorFlow and the Archives Unleashed Toolkit , 2019, 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL).

[3]  Mathieu Bastian,et al.  Gephi: An Open Source Software for Exploring and Manipulating Networks , 2009, ICWSM.

[4]  Ian Milligan History in the Age of Abundance?: How the Web Is Transforming Historical Research , 2019 .

[5]  Ralph Schroeder,et al.  The Web as History , 2017 .

[6]  Jimmy J. Lin,et al.  Desiderata for exploratory search interfaces to Web archives in support of scholarly activities , 2016, 2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL).

[7]  Ian Milligan,et al.  The Cost of a WARC: Analyzing Web Archives in the Cloud , 2019, 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL).

[8]  Ralph Schroeder,et al.  The Web as History : Using Web Archives to Understand the Past and the Present , 2017 .

[9]  Jimmy Lin,et al.  Building Community and Tools for Analyzing Web Archives Through Datathons , 2019, 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL).

[10]  Franco Moretti Graphs, Maps, Trees: Abstract Models for a Literary History , 2005 .

[11]  Jane Winters Web archives for humanities research: some reflections , 2016 .

[12]  Philip M. Napoli,et al.  Journalism History, Web Archives, and New Methods for Understanding the Evolution of Digital Journalism , 2018, Journalism History and Digital Archives.

[13]  Niels Brügger,et al.  The archived web: Doing history in the digital age , 2018 .

[14]  Ian Milligan,et al.  The SAGE Handbook of Web History , 2018 .

[15]  Jimmy Lin,et al.  The Archives Unleashed Notebook: Madlibs for Jumpstarting Scholarly Exploration of Web Archives , 2019, 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL).

[16]  Avishek Anand,et al.  ArchiveSpark: Efficient Web archive access, extraction and derivation , 2016, 2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL).

[17]  Matthew Farrell,et al.  Web Archiving in the United States - A 2017 Survey , 2014 .