Ranking billions of web pages using diodes

Introduction Because of the web's rapid growth and lack of central organization, Internet search engines play a vital role in assisting the users of the Web in retrieving relevant information out of the tens of billions of documents available. With millions of dollars of potential revenue at stake, commercial Web sites compete fiercely to be placed prominently within the first page returned by a search engine. As a result, search engine optimizers (SEOs) developed various forms of search engine spamming (or spamdexing) techniques to artificially inflate the rankings of Web pages. Link-based ranking algorithms, such as Google's PageRank, have been largely effective against most conventional spamming techniques. However, PageRank has three fundamental flaws that, when exploited aggressively, can be proven to be its Achilles' heel: First, PageRank gives a minimum guaranteed score to every page on the Web; second, it rewards all incoming links as valid endorsements; and third, it imposes no penalty for making links to low-quality pages. SEOs can take advantage of these shortcomings to the extreme by employing an Artificial Web, a collection of an extremely large number of computer-generated Web pages containing many links to only a few target pages. Each page of the Artificial Web collects the minimum PageRank and feeds it back to the target pages. Although the individual endorsements are small, the flaws of PageRank make it possible for an Artificial Web to accumulate sizable PageRank values for the target pages. The SEOs can even download a substantial portion of the real Web and modify only the destinations of the hyperlinks, thus circumventing any detection algorithms based on the quality or the size of pages. As the size of an Artificial Web can be comparable to that of the real Web, SEOs can seriously compromise the objectivity of the results that PageRank provides. Although some statistical measures can be employed to identify specific attributes associated with an Artificial Web and filter them out of search results, it is far more desirable to develop a new ranking model that is free of such exploits to begin with.

[1]  William H. Press,et al.  Book-Review - Numerical Recipes in Pascal - the Art of Scientific Computing , 1989 .

[2]  Claude Brezinski,et al.  Numerical recipes in Fortran (The art of scientific computing) : W.H. Press, S.A. Teukolsky, W.T. Vetterling and B.P. Flannery, Cambridge Univ. Press, Cambridge, 2nd ed., 1992. 963 pp., US$49.95, ISBN 0-521-43064-X.☆ , 1993 .

[3]  Christos Faloutsos,et al.  Fast discovery of connection subgraphs , 2004, KDD.

[4]  Michael R. Lyu,et al.  DiffusionRank: a possible penicillin for web spamming , 2007, SIGIR.

[5]  Hector Garcia-Molina,et al.  Combating Web Spam with TrustRank , 2004, VLDB.

[6]  Marc Najork,et al.  Spam, damn spam, and statistics: using statistical analysis to locate spam web pages , 2004, WebDB '04.

[7]  Jon Kleinberg,et al.  The Structure of the Web , 2001, Science.

[8]  K. A. Semendyayev,et al.  Handbook of mathematics , 1985 .

[9]  Antonio Gulli,et al.  The indexable web is more than 11.5 billion pages , 2005, WWW '05.

[10]  Rajeev Motwani,et al.  Stratified Planning , 2009, IJCAI.

[11]  M. Hastings,et al.  Scaling in small-world resistor networks , 2005, cond-mat/0508056.

[12]  Sriram Raghavan,et al.  Searching the Web , 2001, ACM Trans. Internet Techn..

[13]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[14]  Shlomo Havlin,et al.  Anomalous transport in scale-free networks. , 2005, Physical review letters.

[15]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.