Warcbase: Scalable Analytics Infrastructure for Exploring Web Archives

Web archiving initiatives around the world capture ephemeral Web content to preserve our collective digital memory. However, unlocking the potential of Web archives for humanities scholars and social scientists requires a scalable analytics infrastructure to support exploration of captured content. We present Warcbase, an open-source Web archiving platform that aims to fill this need. Our platform takes advantage of modern open-source “big data” infrastructure, namely Hadoop, HBase, and Spark, that has been widely deployed in industry. Warcbase provides two main capabilities: support for temporal browsing and a domain-specific language that allows scholars to interrogate Web archives in several different ways. This work represents a collaboration between computer scientists and historians, where we have engaged in iterative codesign to build tools for scholars with no formal computer science training. To provide guidance, we propose a process model for scholarly interactions with Web archives that begins with a question and proceeds iteratively through four main steps: filter, analyze, aggregate, and visualize. We call this the FAAV cycle for short and illustrate with three prototypical case studies. This article presents the current state of the project and discusses future directions.

[1]  Jimmy J. Lin,et al.  Content selection and curation for web archiving: The gatekeepers vs. the masses , 2016, 2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL).

[2]  Avishek Anand,et al.  ArchiveSpark: Efficient Web archive access, extraction and derivation , 2016, 2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL).

[3]  M. Doyle,et al.  Imagine Nation : The American Counterculture of the 1960's and 70's , 2013 .

[4]  Kjetil Nørvåg Space-Efficient Support for Temporal Text Indexing in a Document Archive Context , 2003, ECDL.

[5]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[6]  Jinfang Niu An Overview of Web Archiving , 2012 .

[7]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[8]  Miguel Costa,et al.  A Survey on Web Archiving Initiatives , 2011, TPDL.

[9]  Michael L. Nelson,et al.  Access patterns for robots and humans in web archives , 2013, JCDL '13.

[10]  Richard Flacks,et al.  Children of Privilege: Student Revolt in the Sixties. , 1985 .

[11]  Abdul Rasheed,et al.  Fedora Commons With Apache Hadoop: A Research Study , 2013 .

[12]  Wolfgang Kienreich,et al.  Visual Knowledge Discovery in Dynamic Enterprise Text Repositories , 2009, 2009 13th International Conference Information Visualisation.

[13]  GhemawatSanjay,et al.  The Google file system , 2003 .

[14]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[15]  Ian Milligan Rebel Youth: 1960s Labour Unrest, Young Workers, and New Leftists in English Canada , 2014 .

[16]  Jimmy J. Lin,et al.  Desiderata for exploratory search interfaces to Web archives in support of scholarly activities , 2016, 2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL).

[17]  Christopher Olston,et al.  What's new on the web?: the evolution of the web from a search engine perspective , 2004, WWW '04.

[18]  Miguel Costa,et al.  Search the past with the portuguese web archive , 2013, WWW.

[19]  Mahadev Konar,et al.  ZooKeeper: Wait-free Coordination for Internet-scale Systems , 2010, USENIX ATC.

[20]  Jimmy J. Lin Scaling Down Distributed Infrastructure on Wimpy Machines for Personal Web Archiving , 2015, WWW.

[21]  Meghan Dougherty,et al.  Community, tools, and practices in web archiving: The state‐of‐the‐art in relation to social science and humanities research needs , 2014, J. Assoc. Inf. Sci. Technol..

[22]  Todd Gitlin,et al.  The Sixties: Years of Hope, Days of Rage , 1987 .

[23]  Torsten Suel,et al.  Improved index compression techniques for versioned document collections , 2010, CIKM '10.

[24]  Ralph Schroeder,et al.  The Web as History , 2017 .

[25]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[26]  Van Gosse,et al.  The Sixties: Cultural Revolution in Britain, France, Italy, and the United States, c.1958-c.1974 , 1998 .

[27]  Sang Chul Song Long-term Information Preservation and Access , 2010 .

[28]  Niels Brgger,et al.  Web History , 2010 .

[29]  Niels Brügger,et al.  Historical Network Analysis of the Web , 2013 .

[30]  Edward S. Greenberg,et al.  If I Had a Hammer—The Death of the Old Left and the Birth of the New Left . By Isserman Maurice. New York: Basic Books, 1987. 259p. $18.95. , 1988, American Political Science Review.

[31]  Mikhail Bautin,et al.  Storage Infrastructure Behind Facebook Messages: Using HBase at Scale , 2012, IEEE Data Eng. Bull..

[32]  Michael Herscovici,et al.  Efficient Indexing of Versioned Document Sequences , 2007, ECIR.

[33]  Kirsten A. Foot,et al.  The Web as an Object of Study , 2004, New Media Soc..

[34]  Brewster Kahle,et al.  Preserving the Internet , 1997 .

[35]  Heidrun Schumann,et al.  Visual knowledge discovery , 2004, Comput. Graph..

[36]  Miguel Costa,et al.  A survey of web archive search architectures , 2013, WWW.

[37]  Gerhard Weikum,et al.  A Time Machine for Text Search , 2022 .

[38]  Ian Milligan,et al.  Welcome to the web: The online community of GeoCities during the early years of the World Wide Web , 2017 .

[39]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[40]  Jeffrey Heer,et al.  Termite: visualization techniques for assessing textual topic models , 2012, AVI.

[41]  Jimmy J. Lin,et al.  Infrastructure for supporting exploration and discovery in web archives , 2014, WWW '14 Companion.

[42]  Felix Hueber If I Had A Hammer The Death Of The Old Left And The Birth Of The New Left , 2016 .

[43]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[44]  Lucy T. Nowell,et al.  ThemeRiver: Visualizing Thematic Changes in Large Document Collections , 2002, IEEE Trans. Vis. Comput. Graph..

[45]  Valérie Schafer The past issue of the Web , 2015 .

[46]  Ed Summers,et al.  Bots, Seeds and People: Web Archives as Infrastructure , 2016, CSCW.