Infrastructure for supporting exploration and discovery in web archives

Web archiving initiatives around the world capture ephemeral web content to preserve our collective digital memory. However, unlocking the potential of web archives requires tools that support exploration and discovery of captured content. These tools need to be scalable and responsive, and to this end we believe that modern "big data" infrastructure can provide a solid foundation. We present Warcbase, an open-source platform for managing web archives built on the distributed datastore HBase. Our system provides a flexible data model for storing and managing raw content as well as metadata and extracted knowledge. Tight integration with Hadoop provides powerful tools for analytics and data processing. Relying on HBase for storage infrastructure simplifies the development of scalable and responsive applications. We describe a service that provides temporal browsing and an interactive visualization based on topic models that allows users to explore archived content.

[1]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[2]  Christopher Olston,et al.  What's new on the web?: the evolution of the web from a search engine perspective , 2004, WWW '04.

[3]  Miguel Costa,et al.  Search the past with the portuguese web archive , 2013, WWW.

[4]  Miguel Costa,et al.  A Survey on Web Archiving Initiatives , 2011, TPDL.

[5]  Jimmy J. Lin,et al.  Book Reviews: Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer , 2010, CL.

[6]  Mikhail Bautin,et al.  Storage Infrastructure Behind Facebook Messages: Using HBase at Scale , 2012, IEEE Data Eng. Bull..

[7]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[8]  Gerhard Weikum,et al.  A Time Machine for Text Search , 2022 .

[9]  Abdul Rasheed,et al.  Fedora Commons With Apache Hadoop: A Research Study , 2013 .

[10]  Kjetil Nørvåg Space-Efficient Support for Temporal Text Indexing in a Document Archive Context , 2003, ECDL.

[11]  Mahadev Konar,et al.  ZooKeeper: Wait-free Coordination for Internet-scale Systems , 2010, USENIX ATC.

[12]  Sang Chul Song Long-term Information Preservation and Access , 2010 .

[13]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[14]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[15]  Jordan L. Boyd-Graber,et al.  Mr. LDA: a flexible large scale topic modeling package using variational inference in MapReduce , 2012, WWW.

[16]  Michael Herscovici,et al.  Efficient Indexing of Versioned Document Sequences , 2007, ECIR.

[17]  Miguel Costa,et al.  A survey of web archive search architectures , 2013, WWW.

[18]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[19]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[20]  Torsten Suel,et al.  Improved index compression techniques for versioned document collections , 2010, CIKM '10.

[21]  Jeffrey Heer,et al.  Termite: visualization techniques for assessing textual topic models , 2012, AVI.