Web Archiving: Organizing Web Objects into Web Containers to Optimize Access

The web is becoming the preferred medium for communicating and storing information pertaining to almost any human activity. However it is an ephemeral medium whose contents are constantly changing, resulting in a permanent loss of part of our cultural and scientific heritage on a regular basis. Archiving important web contents is a very challenging technical problem due to its tremendous scale and complex structure, extremely dynamic nature, and its rich heterogeneous and deep contents. In this paper, we consider the problem of archiving a linked set of web objects into web containers in such a way as to minimize the number of containers accessed during a typical browsing session. We develop a method that makes use of the notion of PageRank and optimized graph partitioning to enable faster browsing of archived web contents. We include simulation results that illustrate the performance of our scheme and compare it to the common scheme currently used to organize web objects into web containers.

[1]  Brian W. Kernighan,et al.  An efficient heuristic procedure for partitioning graphs , 1970, Bell Syst. Tech. J..

[2]  Chung-Kuan Cheng,et al.  An improved two-way partitioning algorithm with stable performance [VLSI] , 1991, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[3]  A. Kahng,et al.  A new approach to effective circuit clustering , 1992, 1992 IEEE/ACM International Conference on Computer-Aided Design.

[4]  George Karypis,et al.  Multilevel k-way Partitioning Scheme for Irregular Graphs , 1998, J. Parallel Distributed Comput..

[5]  Alex Pothen,et al.  PARTITIONING SPARSE MATRICES WITH EIGENVECTORS OF GRAPHS* , 1990 .

[6]  Julien Masanés Web Archiving: Issues and Methods , 2006 .

[7]  Gene H. Golub,et al.  Matrix computations , 1983 .

[8]  Gary L. Miller,et al.  Automatic Mesh Partitioning , 1992 .

[9]  Curt Jones,et al.  A Heuristic for Reducing Fill-In in Sparse Matrix Factorization , 1993, PPSC.

[10]  Geoffrey C. Fox,et al.  Graph contraction for physical optimization methods: a quality-cost tradeoff for mapping data on parallel computers , 1993, ICS '93.

[11]  Bruce Hendrickson,et al.  An Improved Spectral Graph Partitioning Algorithm for Mapping Parallel Computations , 1995, SIAM J. Sci. Comput..

[12]  Michael A. Shepherd,et al.  The impact of task on the usage of web browser navigation mechanisms , 2006, Graphics Interface.

[13]  A. George Nested Dissection of a Regular Finite Element Mesh , 1973 .

[14]  Alan George,et al.  Computer Solution of Large Sparse Positive Definite , 1981 .

[15]  Gary L. Miller,et al.  A unified geometric approach to graph separators , 1991, [1991] Proceedings 32nd Annual Symposium of Foundations of Computer Science.

[16]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[17]  Michael T. Heath,et al.  A Cartesian Parallel Nested Dissection Algorithm , 1992, SIAM J. Matrix Anal. Appl..

[18]  R. M. Mattheyses,et al.  A Linear-Time Heuristic for Improving Network Partitions , 1982, 19th Design Automation Conference.

[19]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[20]  Horst D. Simon,et al.  Fast multilevel implementation of recursive spectral bisection for partitioning unstructured problems , 1994, Concurr. Pract. Exp..

[21]  Sriram Raghavan,et al.  WebBase: a repository of Web pages , 2000, Comput. Networks.