Block-Based Similarity Search on the Web Using Manifold-Ranking

Similarity search on the web aims to find web pages similar to a query page and return a ranked list of similar web pages. The popular approach to web page similarity search is to calculate the pairwise similarity between web pages using the Cosine measure and then rank the web pages by their similarity values with the query page. In this paper, we proposed a novel similarity search approach based on manifold-ranking of page blocks to re-rank the initially retrieved web pages. First, web pages are segmented into semantic blocks with the VIPS algorithm. Second, the blocks get their ranking scores based on the manifold-ranking algorithm. Finally, web pages are re-ranked according to the overall retrieval scores obtained by fusing the ranking scores of the corresponding blocks. The proposed approach evaluates web page similarity at a finer granularity of page block instead of at the traditionally coarse granularity of the whole web page. Moreover, it can make full use of the intrinsic global manifold structure of the blocks to rank the blocks more appropriately. Experimental results on the ODP data demonstrate that the proposed approach can significantly outperform the popular Cosine measure. Semantic block is validated to be a better unit than the whole web page in the manifold-ranking process.

[1]  Wei-Ying Ma,et al.  Organizing WWW images based on the analysis of page layout and Web link structure , 2004, 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763).

[2]  Dan Klein,et al.  Evaluating strategies for similarity search on the web , 2002, WWW '02.

[3]  Roger D. Hersch,et al.  Electronic Publishing, Artistic Imaging, and Digital Typography , 1998, Lecture Notes in Computer Science.

[4]  Wei-Ying Ma,et al.  Block-based web search , 2004, SIGIR '04.

[5]  Veljko M. Milutinovic,et al.  Recognition of common areas in a Web page using visual information: a possible application in a page classification , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[6]  Jennifer Widom,et al.  SimRank: a measure of structural-context similarity , 2002, KDD.

[7]  Dániel Fogaras,et al.  Scaling link-based similarity search , 2005, WWW '05.

[8]  Wei-Ying Ma,et al.  VIPS: a Vision-based Page Segmentation Algorithm , 2003 .

[9]  Wei-Ying Ma,et al.  Hierarchical clustering of WWW image search results using visual, textual and link information , 2004, MULTIMEDIA '04.

[10]  Bernhard Schölkopf,et al.  Ranking on Data Manifolds , 2003, NIPS.

[11]  Monika Henzinger,et al.  Finding Related Pages in the World Wide Web , 1999, Comput. Networks.

[12]  Bernhard Schölkopf,et al.  Learning with Local and Global Consistency , 2003, NIPS.

[13]  Michael R. Lyu,et al.  PageSim: a novel link-based measure of web page aimilarity , 2006, WWW '06.

[14]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[15]  Wei-Ying Ma,et al.  Block-level link analysis , 2004, SIGIR '04.

[16]  Michael R. Lyu,et al.  PageSim: A Novel Link-Based Similarity Measure for the World Wide Web , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[17]  Wei-Ying Ma,et al.  Learning block importance models for web pages , 2004, WWW '04.

[18]  Wei-Ying Ma,et al.  Improving pseudo-relevance feedback in web information retrieval using web page segmentation , 2003, WWW '03.

[19]  Sachindra Joshi,et al.  A bag of paths model for measuring structural similarity in Web documents , 2003, KDD '03.

[20]  Baoyao Zhou,et al.  Function-based object model towards website adaptation , 2001, WWW '01.

[21]  Isabel F. Cruz,et al.  Measuring Structural Similarity Among Web Documents: Preliminary Results , 1998, EP.

[22]  Anastasios Tombros,et al.  Factors Affecting Web Page Similarity , 2005, ECIR.

[23]  Edward A. Fox,et al.  MRSSA: an iterative algorithm for similarity spreading over interrelated objects , 2004, CIKM '04.