Digital humanities and web archives: Possible new paths for combining datasets

This article discusses the importance of web archives making their collections available as data and not only as sources seen through the Wayback Machine’s interface where only individual web pages are displayed. This will help unlock the full potential of the treasure trove that web archives constitute, and thereby also open up for methods from the wider field of digital humanities. Based on a case study of the entire Danish web domain .dk the article discusses methodological challenges involved in combining large extracted datasets from web archives, namely metadata about the size of websites and data about hyperlinks from the same websites. The aim is to answer the following two questions: 1) How to combine two different types of datasets extracted from a web archive, in this case the Danish Netarkivet? 2) What can the result of such a combination teach us about the structural characteristics of the Danish web domain from 2006 to 2015? The article shows that, indeed, it is possible to go beyond the Wayback Machine as the prime interface to web archives by combining two distinct datasets, and that such a venture can provide valuable knowledge about the overall structure of the Danish web domain, thus highlighting that websites of the same size tend to constitute isolated ‘link islands’, and that big websites are also the most important in the hyperlink network, which is more clearly the case in 2015 than in 2006.

[1]  Herbert Van de Sompel,et al.  Only One Out of Five Archived Web Pages Existed as Presented , 2015, HT.

[2]  Anne Helmond Historical Website Ecology: Analyzing Past States of the Web Using Archived Source Code , 2017 .

[3]  Justin Joque Visualizing Historical Web Data , 2019 .

[4]  Thomas Padilla,et al.  Digital sources and digital archives: historical evidence in the digital age , 2020, International Journal of Digital Humanities.

[5]  Sally Chambers,et al.  Web archives as a data resource for digital scholars , 2019, International Journal of Digital Humanities.

[6]  Ditte Laursen,et al.  Big data experiments with the archived Web: Methodological reflections on studying the development of a nation's Web , 2020, First Monday.

[7]  Using mixed methods to study the historical use of web beacons in web tracking , 2021 .

[8]  Niels Brügger,et al.  The archived web: Doing history in the digital age , 2018 .

[9]  Anat Ben-David,et al.  What does the Web remember of its deleted past? An archival reconstruction of the former Yugoslav top-level domain , 2016, New Media Soc..

[10]  Niels Brügger,et al.  Digital Humanities in the 21st Century: Digital Material as a Driving Force , 2016, Digit. Humanit. Q..

[11]  Janne Nielsen Quantitative Approaches to the Danish Web Archive , 2021, The Past Web.

[12]  Matthew S. Weber,et al.  Newspapers and the Long-Term Implications of Hyperlinking , 2012, J. Comput. Mediat. Commun..

[13]  Niels Brügger,et al.  Establishing a corpus of the archived web , 2019 .

[14]  Peter Webster Existing Web Archives , 2019 .

[15]  Ralph Schroeder,et al.  The Web as History , 2017 .

[16]  Anat Ben-David,et al.  The Internet Archive and the socio-technical construction of historical facts , 2018 .

[17]  Esther Weltevrede,et al.  Where do bloggers blog? Platform transitions within the historical Dutch blogosphere , 2012, First Monday.

[18]  Ditte Laursen,et al.  The curious case of archiving .eu , 2019, The Historical Web and Digital Humanities.

[19]  Janne Nielsen,et al.  Experimenting with computational methods for large-scale studies of tracking technologies in web archives , 2019, Internet Histories.