Community, tools, and practices in web archiving: The state‐of‐the‐art in relation to social science and humanities research needs

The web encourages the constant creation and distribution of large amounts of information; it is also a valuable resource for understanding human behavior and communication. To take full advantage of the web as a research resource that extends beyond the consideration of snapshots of the present, however, it is necessary to begin to take web archiving much more seriously as an important element of any research program involving web resources. The ephemeral character of the web requires that researchers take proactive steps in the present to enable future analysis. Efforts to archive the web or portions thereof have been developed around the world, but these efforts have not yet provided reliable and scalable solutions. This article summarizes the current state of web archiving in relation to researchers and research needs. Interviews with researchers, archivists, and technologists identify the differences in purpose, scope, and scale of current web archiving practice, and the professional tensions that arise given these differences. Findings outline the challenges that still face researchers who wish to engage seriously with web content as an object of research, and archivists who must strike a balance reflecting a range of user needs.

[1]  Waiman Cheung,et al.  Evolution of e-commerce Web sites: A conceptual framework and a longitudinal study , 2007, Inf. Manag..

[2]  Eelco Herder,et al.  Not quite the average: An empirical study of Web use , 2008, TWEB.

[3]  Michael L. Nelson,et al.  Lazy preservation: reconstructing websites by crawling the crawlers , 2006, WIDM '06.

[4]  Helen Willa Samuels,et al.  Who Controls the Past , 2010 .

[5]  Jamie Murphy,et al.  Take Me Back: Validating the Wayback Machine , 2007, J. Comput. Mediat. Commun..

[6]  Panos Constantopoulos,et al.  Understanding the Information Requirements of Arts and Humanities Scholarship , 2010, Int. J. Digit. Curation.

[7]  Michael A. Veronin,et al.  Where Are They Now? A Case Study of Health-related Web Site Attrition , 2002, Journal of medical Internet research.

[8]  Julien Masanès,et al.  Web Archiving , 2014, Encyclopedia of Social Network Analysis and Mining.

[9]  Kirsten A. Foot,et al.  The Web as an Object of Study , 2004, New Media Soc..

[10]  Brian D. Davison,et al.  Introduction to special section on adversarial issues in Web search , 2008, TWEB.

[11]  Elizabeth J. Van Every,et al.  The Emergent Organization : Communication As Its Site and Surface , 1999 .

[12]  Wallace Koehler,et al.  A longitudinal study of Web pages continued: a consideration of document persistence , 2003, Inf. Res..

[13]  Christopher M. Anderson,et al.  The web is dead. Long live the Internet , 2010 .

[14]  Susan T. Dumais,et al.  The web changes everything: understanding the dynamics of web content , 2009, WSDM '09.

[15]  David Bearman,et al.  Reinventing Archives for Electronic Records: Alternative Sewice Delivery Options , 1993 .

[16]  Bambang Parmanto,et al.  A longitudinal evaluation of accessibility: higher education web sites , 2005, Internet Res..

[17]  Viktor Mayer-Schönberger,et al.  Delete: The Virtue of Forgetting in the Digital Age , 2009 .

[18]  Hector Garcia-Molina,et al.  The Evolution of the Web and Implications for an Incremental Crawler , 2000, VLDB.

[19]  Mary Ellis Closing an Era: Historical Perspectives on Modern Archives and Records Management , 2002, J. Documentation.

[20]  K. Foot,et al.  The Internet and national elections : a comparative study of web campaigning , 2007 .

[21]  Adam Kilgarriff,et al.  Introduction to the Special Issue on the Web as Corpus , 2003, CL.

[22]  Arthur Thomas,et al.  Researcher Engagement with Web Archives: State of the Art , 2010 .

[23]  Mary K. Taylor,et al.  Linkrot and the usefulness of Web site bibliographies , 2000 .

[24]  M. I. Franklin,et al.  Postcolonial Politics, The Internet and Everyday Life: Pacific Traversals Online , 2004 .

[25]  Mike Thelwall,et al.  A fair history of the Web? Examining country balance in the Internet Archive , 2004 .

[26]  Marc Najork,et al.  A large‐scale study of the evolution of Web pages , 2003, WWW '03.

[27]  Catherine C. Marshall,et al.  Why web sites are lost (and how they're sometimes found) , 2009, Commun. ACM.

[28]  Andrew Richard Albanese Scan This Book , 2007 .

[29]  Leysia Palen,et al.  (How) will the revolution be retweeted?: information diffusion and the 2011 Egyptian uprising , 2012, CSCW.

[30]  Anne Gilliland-Swetland,et al.  DIGITAL COMMUNICATIONS: DOCUMENTARY OPPORTUNITIES NOT TO BE MISSED , 2016 .

[31]  André Brock,et al.  "A Belief in Humanity is a Belief in Colored Men: " Using Culture to Span the Digital Divide , 2005, J. Comput. Mediat. Commun..

[32]  J. Mifflin,et al.  "Mind and Sight": Visual Literacy and the Archivist , 1997 .