A unified probabilistic framework for Web page scoring systems

The definition of efficient page ranking algorithms is becoming an important issue in the design of the query interface of Web search engines. Information flooding is a common experience especially when broad topic queries are issued. Queries containing only one or two keywords usually match a huge number of documents, while users can only afford to visit the first positions of the returned list, which do not necessarily refer to the most appropriate answers. Some successful approaches to page ranking in a hyperlinked environment, like the Web, are based on link analysis. We propose a general probabilistic framework for Web page scoring systems (WPSS), which incorporates and extends many of the relevant models proposed in the literature. In particular, we introduce scoring systems for both generic (horizontal) and focused (vertical) search engines. Whereas horizontal scoring algorithms are only based on the topology of the Web graph, vertical ranking also takes the page contents into account and are the base for focused and user adapted search interfaces. Experimental results are reported to show the properties of some of the proposed scoring systems with special emphasis on vertical search.

[1]  M. KleinbergJon Authoritative sources in a hyperlinked environment , 1999 .

[2]  Matthew Richardson,et al.  The Intelligent surfer: Probabilistic Combination of Link and Content Information in PageRank , 2001, NIPS.

[3]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[4]  Valerie Isham,et al.  Non‐Negative Matrices and Markov Chains , 1983 .

[5]  Sergio Greco,et al.  A Probabilistic Approach for Distillation and Ranking of Web Pages , 2004, World Wide Web.

[6]  Loren G. Terveen,et al.  Does “authority” mean quality? predicting expert quality ratings of Web documents , 2000, SIGIR '00.

[7]  David Cohn,et al.  Learning to Probabilistically Identify Authoritative Documents , 2000, ICML.

[8]  Marco Gori,et al.  Web page scoring systems for horizontal and vertical search , 2002, WWW.

[9]  Sriram Raghavan,et al.  Searching the Web , 2001, ACM Trans. Internet Techn..

[10]  David A. Cohn,et al.  The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity , 2000, NIPS.

[11]  Soumen Chakrabarti,et al.  Enhanced topic distillation using text, markup tags, and hyperlinks , 2001, SIGIR '01.

[12]  M. M. Kessler Bibliographic coupling between scientific papers , 1963 .

[13]  Shlomo Moran,et al.  SALSA: the stochastic approach for link-structure analysis , 2001, TOIS.

[14]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[15]  Shlomo Moran,et al.  The stochastic approach for link-structure analysis (SALSA) and the TKC effect , 2000, Comput. Networks.

[16]  Krishna Bharat,et al.  Improved algorithms for topic distillation in a hyperlinked environment , 1998, SIGIR '98.

[17]  C. Lee Giles,et al.  Accessibility of information on the web , 1999, Nature.

[18]  G. Padmanaban The Indian Psyche , 1998, Science.

[19]  Taher H. Haveliwala Topic-Sensitive PageRank: A Context-Sensitive Ranking Algorithm for Web Search , 2003, IEEE Trans. Knowl. Data Eng..

[20]  Monika Henzinger,et al.  Hyperlink Analysis for the Web , 2001, IEEE Internet Comput..

[21]  Marco Gori,et al.  Focus Crawling by Context Graphs , 2000 .

[22]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[23]  Taher H. Haveliwala Topic-sensitive PageRank , 2002, IEEE Trans. Knowl. Data Eng..

[24]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[25]  Molly Molloy Searching the Web, Continued , 1998, Science.

[26]  Jon M. Kleinberg,et al.  Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text , 1998, Comput. Networks.