Web archive profiling through CDX summarization

With the proliferation of public web archives, it is becoming more important to better profile their contents, both to understand their immense holdings as well as to support routing of requests in the Memento aggregator. To save time, the Memento aggregator should only poll the archives that are likely to have a copy of the requested URI. Using the crawler index files produced after crawling, we can generate profiles of the archives that summarize their holdings and can be used to inform routing of the Memento aggregator’s URI requests. Previous work in profiling ranged from using full URIs (no false positives, but with large profiles) to using only top-level domains (TLDs) (smaller profiles, but with many false positives). This work explores strategies in between these two extremes. In our experiments, we correctly identified about 78 % of the URIs that were present or not present in the archive with less than 1 % relative cost as compared to the complete knowledge profile and 94 % URIs with less than 10 % relative cost without any false negatives. With respect to the TLD-only profile, the registered domain profile doubled the routing precision, while complete hostname and one path segment gave a tenfold increase in the routing precision.

[1]  Michael L. Nelson,et al.  Support for Various HTTP Methods on the Web , 2014, ArXiv.

[2]  Информатика Public Suffix List , 2010 .

[3]  Peter Deutsch,et al.  GZIP file format specification version 4.3 , 1996, RFC.

[4]  Leo Egghe Untangling Herdan's law and Heaps' law: Mathematical and informetric arguments , 2007, J. Assoc. Inf. Sci. Technol..

[5]  Herbert Van de Sompel,et al.  HTTP Framework for Time-Based Access to Resource States - Memento , 2013, RFC.

[6]  Michael L. Nelson,et al.  Who and what links to the Internet Archive , 2014, International Journal on Digital Libraries.

[7]  Douglas Crockford,et al.  The application/json Media Type for JavaScript Object Notation (JSON) , 2006, RFC.

[8]  Herbert Van de Sompel,et al.  Routing memento requests using binary classifiers , 2016, 2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL).

[9]  King-Lup Liu,et al.  Building efficient and effective metasearch engines , 2002, CSUR.

[10]  Luis Gravano,et al.  STARTS: Stanford proposal for Internet meta-searching , 1997, SIGMOD '97.

[11]  Oren Etzioni,et al.  Query routing for Web search engines: architecture and experiments , 2000, Comput. Networks.

[12]  Herbert Van de Sompel,et al.  Web Archive Profiling Through CDX Summarization , 2015, TPDL.

[13]  Robert Sanderson Global web archive integration with memento , 2012, JCDL '12.

[14]  Lei Zhang,et al.  Keyword Query Routing , 2014, IEEE Transactions on Knowledge and Data Engineering.

[15]  Ling Liu,et al.  Query routing in large-scale digital library systems , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[16]  Herbert Van de Sompel,et al.  Profiling web archive coverage for top-level domain and content language , 2013, International Journal on Digital Libraries.