A Scalable Randomized Method to Compute Link-Based Similarity Rank on the Web Graph

Several iterative hyperlink-based similarity measures were published to express the similarity of web pages However, it usually seems hopeless to evaluate complex similarity functions over large repositories containing hundreds of millions of pages.We introduce scalable algorithms computing SimRank scores, which express the contextual similarities of pages based on the hyperlink structure The proposed methods scale well to large repositories, fulfilling strict requirements about computational complexity The algorithms were tested on a set of ten million pages, but parallelization techniques make it possible to compute the SimRank scores even for the entire web with over 4 billion pages The key idea is that randomized Monte Carlo methods combined with indexing techniques yield a scalable approximation of SimRank.

[1]  C. Lee Giles,et al.  Self-Organization and Identification of Web Communities , 2002, Computer.

[2]  Evangelos E. Milios,et al.  Node similarity in networked information spaces , 2001, CASCON.

[3]  Torsten Suel,et al.  I/O-efficient techniques for computing pagerank , 2002, CIKM '02.

[4]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[5]  Monika Henzinger,et al.  Finding Related Pages in the World Wide Web , 1999, Comput. Networks.

[6]  Jennifer Widom,et al.  SimRank: a measure of structural-context similarity , 2002, KDD.

[7]  Ian H. Witten,et al.  Managing gigabytes (2nd ed.): compressing and indexing documents and images , 1999 .

[8]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[9]  Dan Klein,et al.  Evaluating strategies for similarity search on the web , 2002, WWW '02.

[10]  Gary William Flake,et al.  Self-organization of the web and identification of communities , 2002 .

[11]  Luis Gravano,et al.  Evaluating top-k queries over Web-accessible databases , 2002, Proceedings 18th International Conference on Data Engineering.

[12]  Jop F. Sibeyn,et al.  Algorithms for Memory Hierarchies: Advanced Lectures , 2003 .

[13]  Christos Faloutsos,et al.  ANF: a fast and scalable tool for data mining in massive graphs , 2002, KDD.

[14]  Ian H. Witten,et al.  Managing gigabytes , 1994 .

[15]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[16]  Eric A. Brewer,et al.  Lessons from Giant-Scale Services , 2001, IEEE Internet Comput..

[17]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[18]  Sally Rosenthal,et al.  Parallel computing and Monte Carlo algorithms , 1999 .

[19]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[20]  Berthier A. Ribeiro-Neto,et al.  Link Information as a Similarity Measure in Web Classification , 2003, SPIRE.

[21]  Ulrich Meyer,et al.  Algorithms for Memory Hierarchies , 2003, Lecture Notes in Computer Science.