论文信息 - Infrastructure for supporting exploration and discovery in web archives

Infrastructure for supporting exploration and discovery in web archives

Web archiving initiatives around the world capture ephemeral web content to preserve our collective digital memory. However, unlocking the potential of web archives requires tools that support exploration and discovery of captured content. These tools need to be scalable and responsive, and to this end we believe that modern "big data" infrastructure can provide a solid foundation. We present Warcbase, an open-source platform for managing web archives built on the distributed datastore HBase. Our system provides a flexible data model for storing and managing raw content as well as metadata and extracted knowledge. Tight integration with Hadoop provides powerful tools for analytics and data processing. Relying on HBase for storage infrastructure simplifies the development of scalable and responsive applications. We describe a service that provides temporal browsing and an interactive visualization based on topic models that allows users to explore archived content.

Jimmy J. Lin | Jinfeng Rao | Milad Gholami

[1] Ravi Kumar,et al. Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[2] Christopher Olston,et al. What's new on the web?: the evolution of the web from a search engine perspective , 2004, WWW '04.

[3] Miguel Costa,et al. Search the past with the portuguese web archive , 2013, WWW.

[4] Miguel Costa,et al. A Survey on Web Archiving Initiatives , 2011, TPDL.

[5] Jimmy J. Lin,et al. Book Reviews: Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer , 2010, CL.

[6] Mikhail Bautin,et al. Storage Infrastructure Behind Facebook Messages: Using HBase at Scale , 2012, IEEE Data Eng. Bull..

[7] Michael I. Jordan,et al. Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[8] Gerhard Weikum,et al. A Time Machine for Text Search , 2022 .

[9] Abdul Rasheed,et al. Fedora Commons With Apache Hadoop: A Research Study , 2013 .

[10] Kjetil Nørvåg. Space-Efficient Support for Temporal Text Indexing in a Document Archive Context , 2003, ECDL.

[11] Mahadev Konar,et al. ZooKeeper: Wait-free Coordination for Internet-scale Systems , 2010, USENIX ATC.