The evolution of web archiving

Web archives preserve information published on the web or digitized from printed publications. Much of this information is unique and historically valuable. However, the lack of knowledge about the global status of web archiving initiatives hamper their improvement and collaboration. To overcome this problem, we conducted two surveys, in 2010 and 2014, which provide a comprehensive characterization on web archiving initiatives and their evolution. We identified several patterns and trends that highlight challenges and opportunities. We discuss these patterns and trends that enable to define strategies, estimate resources and provide guidelines for research and development of better technology. Our results show that during the last years there was a significant growth in initiatives and countries hosting these initiatives, volume of data and number of contents preserved. While this indicates that the web archiving community is dedicating a growing effort on preserving digital information, other results presented throughout the paper raise concerns such as the small amount of archived data in comparison with the amount of data that is being published online.

[1]  Julien Masanès,et al.  Web Archiving , 2014, Encyclopedia of Social Network Analysis and Mining.

[2]  Gerhard Weikum,et al.  YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia: Extended Abstract , 2013, IJCAI.

[3]  Michael L. Nelson,et al.  Losing My Revolution: How Many Resources Shared on Social Media Have Been Lost? , 2012, TPDL.

[4]  Herbert Van de Sompel,et al.  Memento: Time Travel for the Web , 2009, ArXiv.

[5]  Miguel Costa,et al.  Evaluating Web Archive Search Systems , 2012, WISE.

[6]  Susan T. Dumais,et al.  Changing how people view changes on the web , 2009, UIST '09.

[7]  William Y. Arms,et al.  From Wayback Machine to Yesternet : New Opportunities for Social Science , 2006 .

[8]  Scott Kirkpatrick,et al.  Architecture of the internet archive , 2009, SYSTOR '09.

[9]  Miguel Costa,et al.  Characterizing Search Behavior in Web Archives , 2011, TWAW.

[10]  Ricardo Baeza-Yates,et al.  The 4th temporal web analytics workshop (TempWeb'14) , 2014, WWW '14 Companion.

[11]  Jure Leskovec,et al.  Meme-tracking and the dynamics of the news cycle , 2009, KDD.

[12]  Jinfang Niu Functionalities of Web Archives , 2012, D Lib Mag..

[13]  William Y. Arms,et al.  A Research Library Based on the Historical Collections of the Internet Archive , 2006, D Lib Mag..

[14]  H. Van de Sompel,et al.  Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot , 2014, PloS one.

[15]  Herbert Van de Sompel,et al.  Profiling web archive coverage for top-level domain and content language , 2013, International Journal on Digital Libraries.

[16]  Christopher Olston,et al.  What's new on the web?: the evolution of the web from a search engine perspective , 2004, WWW '04.

[17]  Anat Ben-David,et al.  Sprint methods for web archive research , 2013, WebSci.

[18]  M. I. Franklin,et al.  Postcolonial Politics, The Internet and Everyday Life: Pacific Traversals Online , 2004 .

[19]  Eric Horvitz,et al.  Mining the web to predict future events , 2013, WSDM.

[20]  Matthew Farrell,et al.  Web Archiving in the United States - A 2017 Survey , 2014 .

[21]  Brad Tofel ‘Wayback’ for Accessing Web Archives , 2007 .

[22]  Miguel Costa,et al.  The Importance of Web Archives for Humanities , 2014, Int. J. Humanit. Arts Comput..

[23]  Masaru Kitsuregawa,et al.  A study of link farm distribution and evolution using a time series of web snapshots , 2009, AIRWeb '09.

[24]  Miguel Costa,et al.  Learning temporal-dependent ranking models , 2014, SIGIR.

[25]  Gerhard Weikum,et al.  Longitudinal Analytics on Web Archive Data: It's About Time! , 2011, CIDR.

[26]  Susan T. Dumais,et al.  Leveraging temporal dynamics of document content in relevance ranking , 2010, WSDM '10.

[27]  Adam Jatowt,et al.  Honto? Search: Estimating Trustworthiness of Web Information by Search Results Aggregation and Temporal Analysis , 2007, APWeb/WAIM.

[28]  Bruce Bimber,et al.  Virtual Observatory for the Study of Online Networks (VOSON) , 2005 .

[29]  Thomas Risse,et al.  ARCOMEM: from collect-all ARchives to COmmunity MEMories , 2012, WWW.

[30]  R. Dellavalle,et al.  Going, Going, Gone: Lost Internet References , 2003, Science.

[31]  Arthur Thomas,et al.  Researcher Engagement with Web Archives: Challenges and Opportunities for Investment , 2010 .

[32]  Mira Dontcheva,et al.  Zoetrope: interacting with the ephemeral web , 2008, UIST '08.

[33]  Masaru Kitsuregawa,et al.  Socio-Sense: A System for Analysing the Societal Behavior from Long Term Web Archive , 2008, APWeb.

[34]  Michael L. Nelson,et al.  How much of the web is archived? , 2011, JCDL '11.

[35]  Miguel Costa,et al.  A Survey on Web Archiving Initiatives , 2011, TPDL.

[36]  Paul Lindner Losing My Revolution: How Many Resources Shared on Social Media Have Been Lost? , 2016 .

[37]  Peter Mika,et al.  Searching through time in the New York Times HCIR Challenge 2010 , 2010 .