Link analysis ranking

The explosive growth and the widespread accessibility of the Web has led to surge of research activity in the area of information retrieval on the World Wide Web. Ranking has always been an important component of any information retrieval system. In the case of Web search its importance becomes critical. Due to the size of the Web, it is imperative to have ranking functions that capture the user needs. To this end the Web offers a rich context of information which is expressed through the hyperlinks. In this thesis we investigate, theoretically and experimentally, the application of Link Analysis to ranking on the Web. Building upon the framework of hubs and authorities [57], we propose new families of Link Analysis Ranking algorithms. Some of the algorithms we define no longer enjoy the linearity property of the previous algorithms. As a result, it is harder to analyze them, or even prove that they actually converge. However, for a special case of the families we consider, we are able to prove that it will converge, and we provide a complete characterization of the combinatorial properties of the stationary authority weights it produces. The plethora of Link Analysis Ranking algorithms generates the necessity for a formal way to evaluate their properties and compare their behavior. We introduce a theoretical framework for the study of Link Analysis Ranking algorithms, and we define specific properties of the algorithms within this framework. Using these properties we are able to provide an axiomatic characterization of the INDEGREE algorithm that ranks pages according the number of in-coming links. We conclude the thesis with an extensive experimental evaluation of Link Analysis Ranking. We test the algorithms over multiple queries, and we use user feedback to determine their quality. Our experiments reveal some of the limitations of Link Analysis Ranking. Specifically, it appears that for most algorithms, the nodes and the structures in the graph that they favor, do not correspond to the most relevant pages in the collection. These observations offer a new insight into the mechanics of the algorithms, and we believe that they will lead to improved algorithm design, and better input graphs for the algorithms.

[1]  Santosh S. Vempala,et al.  On clusterings: Good, bad and spectral , 2004, JACM.

[2]  Matthew Richardson,et al.  The Intelligent surfer: Probabilistic Combination of Link and Content Information in PageRank , 2001, NIPS.

[3]  Ben Shneiderman,et al.  Structural analysis of hypertexts: identifying hierarchies and useful metrics , 1992, TOIS.

[4]  Jon M. Kleinberg,et al.  Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text , 1998, Comput. Networks.

[5]  Ravi Kumar,et al.  Trawling the Web for Emerging Cyber-Communities , 1999, Comput. Networks.

[6]  Ronald Fagin,et al.  Searching the workplace web , 2003, WWW '03.

[7]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[8]  Christoph Braun,et al.  Coherence of gamma-band EEG activity as a basis for associative learning , 1999, Nature.

[9]  Ronald Fagin,et al.  Comparing and aggregating rankings with ties , 2004, PODS '04.

[10]  R. Devaney An Introduction to Chaotic Dynamical Systems , 1990 .

[11]  Krishna Bharat,et al.  When experts agree: using non-affiliated experts to rank popular topics , 2002, ACM Trans. Inf. Syst..

[12]  Yannis E. Ioannidis,et al.  Selectivity Estimation Without the Attribute Value Independence Assumption , 1997, VLDB.

[13]  Michael I. Jordan,et al.  Stable algorithms for link analysis , 2001, SIGIR '01.

[14]  Allan Borodin,et al.  Finding authorities and hubs from link structures on the World Wide Web , 2001, WWW '01.

[15]  R. Graham,et al.  Spearman's Footrule as a Measure of Disarray , 1977 .

[16]  Gabriel Pinski,et al.  Citation influence for journal aggregates of scientific publications: Theory, with application to the literature of physics , 1976, Inf. Process. Manag..

[17]  Piotr Indyk,et al.  Enhanced hypertext categorization using hyperlinks , 1998, SIGMOD '98.

[18]  Michael I. Jordan,et al.  Link Analysis, Eigenvectors and Stability , 2001, IJCAI.

[19]  Renée J. Miller,et al.  LIMBO: Scalable Clustering of Categorical Data , 2004, EDBT.

[20]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[21]  Oded Galor,et al.  Discrete Dynamical Systems , 2005 .

[22]  Jon M. Kleinberg,et al.  Clustering categorical data: an approach based on dynamical systems , 2000, The VLDB Journal.

[23]  Donna K. Harman,et al.  Results and Challenges in Web Search Evaluation , 1999, Comput. Networks.

[24]  John A. Tomlin,et al.  A new paradigm for ranking pages on the world wide web , 2003, WWW '03.

[25]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[26]  Peter Bailey,et al.  Overview of the TREC-8 Web Track , 2000, TREC.

[27]  C. Lee Giles,et al.  Efficient identification of Web communities , 2000, KDD '00.

[28]  Monika Henzinger,et al.  Finding Related Pages in the World Wide Web , 1999, Comput. Networks.

[29]  Jon M. Kleinberg,et al.  Inferring Web communities from link topology , 1998, HYPERTEXT '98.

[30]  Piotr Indyk,et al.  Similarity Search on the Web: Evaluation and Scalability Considerations , 2001 .

[31]  Alberto O. Mendelzon,et al.  What is this page known for? Computing Web page reputations , 2000, Comput. Networks.

[32]  Mark E. Frisse,et al.  Searching for information in a hypertext medical handbook , 1987, Commun. ACM.

[33]  Naftali Tishby,et al.  Agglomerative Information Bottleneck , 1999, NIPS.

[34]  Charles H. Hubbell An Input-Output Approach to Clique Identification , 1965 .

[35]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[36]  Jacques Savoy,et al.  Report on the TREC-8 Experiment: Searching on the Web and in Distributed Collections , 1999, TREC.

[37]  Massimo Marchiori,et al.  The Quest for Correct Information on the Web: Hyper Search Engines , 1997, Comput. Networks.

[38]  E. Garfield Citation analysis as a tool in journal evaluation. , 1972, Science.

[39]  Aya Soffer,et al.  PicASHOW: pictorial authority search by hyperlinks on the Web , 2001, WWW '01.

[40]  Craig Silverstein,et al.  Analysis of a Very Large Altavista Query Log" SRC Technical note #1998-14 , 1998 .

[41]  Patrick Doreian,et al.  Measuring the relative standing of disciplinary journals , 1988, Inf. Process. Manag..

[42]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[43]  Leo Katz,et al.  A new status index derived from sociometric analysis , 1953 .

[44]  Rick Kazman,et al.  WebQuery: Searching and Visualizing the Web Through Connectivity , 1997, Comput. Networks.

[45]  Santosh S. Vempala,et al.  Latent semantic indexing: a probabilistic analysis , 1998, PODS '98.

[46]  Jacques Savoy,et al.  Report on the TREC-9 Experiment: Link-based Retrieval and Distributed Collections , 2000, TREC.

[47]  Ronald Fagin,et al.  Comparing top k lists , 2003, SODA '03.

[48]  W. Greub Linear Algebra , 1981 .

[49]  Amit Singhal,et al.  A case study in web search using TREC algorithms , 2001, WWW '01.

[50]  Alan M. Frieze,et al.  Clustering in large graphs and matrices , 1999, SODA '99.

[51]  Wei Zhang,et al.  Improvement of HITS-based algorithms on web documents , 2002, WWW '02.

[52]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[53]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[54]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[55]  Shlomo Moran,et al.  Rank stability and rank similarity of web link-based ranking algorithms , 2001 .

[56]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[57]  Krishna Bharat,et al.  Improved algorithms for topic distillation in a hyperlinked environment , 1998, SIGIR '98.

[58]  Feng Shao,et al.  XRANK: ranked keyword search over XML documents , 2003, SIGMOD '03.

[59]  Richard A. Holmgren A First Course in Discrete Dynamical Systems , 1994 .

[60]  Thomas Hofmann Learning Probabilistic Models of the Web , 2000, SIGIR 2000.

[61]  Anna R. Karlin,et al.  Spectral analysis of data , 2001, STOC '01.

[62]  Gareth O. Roberts,et al.  Markov‐chain monte carlo: Some practical implications of theoretical results , 1998 .

[63]  Nancy L. Geller,et al.  On the citation influence methodology of Pinski and Narin , 1978, Inf. Process. Manag..

[64]  Jennifer Widom,et al.  Scaling personalized web search , 2003, WWW '03.

[65]  Sylvia Richardson,et al.  Markov Chain Monte Carlo in Practice , 1997 .

[66]  David Cohn,et al.  Learning to Probabilistically Identify Authoritative Documents , 2000, ICML.

[67]  PATRICK DOREIAN,et al.  A Measure of Standing for Citation Networks Within a Wider Environment , 1994, Inf. Process. Manag..

[68]  Brian D. Davison Recognizing Nepotistic Links on the Web , 2000 .

[69]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[70]  S. Sudarshan,et al.  Keyword searching and browsing in databases using BANKS , 2002, Proceedings 18th International Conference on Data Engineering.

[71]  Shlomo Moran,et al.  The stochastic approach for link-structure analysis (SALSA) and the TKC effect , 2000, Comput. Networks.

[72]  M. Kendall Rank Correlation Methods , 1949 .

[73]  Gareth O. Roberts,et al.  Downweighting tightly knit communities in world wide web ranking. , 2003 .

[74]  Alberto O. Mendelzon,et al.  What do the Neighbours Think? Computing Web Page Reputations , 2000, IEEE Data Eng. Bull..

[75]  Christos Faloutsos,et al.  Efficiently supporting ad hoc queries in large datasets of time sequences , 1997, SIGMOD '97.

[76]  John A. Tomlin,et al.  An entropy approach to unintrusive targeted advertising on the Web , 2000, Comput. Networks.

[77]  Moni Naor,et al.  Rank aggregation methods for the Web , 2001, WWW '01.

[78]  Amanda Spink,et al.  Real life information retrieval: a study of user queries on the Web , 1998, SIGF.

[79]  C. Lee Giles,et al.  Self-Organization and Identification of Web Communities , 2002, Computer.

[80]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.