Adaptive on-line page importance computation

The computation of page importance in a huge dynamic graph has recently attracted a lot of attention because of the web. Page importance, or page rank is defined as the fixpoint of a matrix equation. Previous algorithms compute it off-line and require the use of a lot of extra CPU as well as disk resources (e.g. to store, maintain and read the link matrix). We introduce a new algorithm OPIC that works on-line, and uses much less resources. In particular, it does not require storing the link matrix. It is on-line in that it continuously refines its estimate of page importance while the web/graph is visited. Thus it can be used to focus crawling to the most interesting pages. We prove the correctness of OPIC. We present Adaptive OPIC that also works on-line but adapts dynamically to changes of the web. A variant of this algorithm is now used by Xyleme.We report on experiments with synthetic data. In particular, we study the convergence and adaptiveness of the algorithms for various scheduling strategies for the pages to visit. We also report on experiments based on crawls of significant portions of the web.

[1]  Andrei Z. Broder,et al.  Graph structure in the Web , 2000, Comput. Networks.

[2]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[3]  Dell Zhang,et al.  An efficient algorithm to rank Web resources , 2000, Comput. Networks.

[4]  Alan M. Frieze,et al.  A General Model of Undirected Web Graphs , 2001, ESA.

[5]  Kai Lai Chung,et al.  Markov Chains with Stationary Transition Probabilities , 1961 .

[6]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[7]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[8]  Akhil Kumar,et al.  A dynamic warehouse for XML Data of the Web. , 2001 .

[9]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[10]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[11]  Sivan Toledo,et al.  Improving the memory-system performance of sparse-matrix vector multiplication , 1997, IBM J. Res. Dev..

[12]  Serge Abiteboul,et al.  A First Experience in Archiving the French Web , 2002, ECDL.

[13]  Hector Garcia-Molina,et al.  Synchronizing a database to improve freshness , 2000, SIGMOD '00.

[14]  George V. Meghabghab,et al.  Google's web page ranking applied to different topological web graph structures , 2001, J. Assoc. Inf. Sci. Technol..

[15]  Steve Chien,et al.  Link Evolution: Analysis and Algorithms , 2004, Internet Math..

[16]  Rajeev Motwani,et al.  Randomized algorithms , 1996, CSUR.

[17]  F. Gantmacher,et al.  Applications of the theory of matrices , 1960 .

[18]  Monika Henzinger,et al.  Finding Related Pages in the World Wide Web , 1999, Comput. Networks.

[19]  Taher H. Haveliwala Efficient Computation of PageRank , 1999 .

[20]  Andrei Z. Broder,et al.  A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines , 1998, Comput. Networks.