Local versus global link information in the Web

Information derived from the cross-references among the documentsin a hyperlinked environment, usually referred to as linkinformation, is considered important since it can be used toeffectively improve document retrieval. Depending on the retrievalstrategy, link information can be local or global. Local linkinformation is derived from the set of documents returned asanswers to the current user query. Global link information isderived from all the documents in the collection. In this work, weinvestigate how the use of local link information compares to theuse of global link information. For the comparison, we run a seriesof experiments using a large document collection extracted from theWeb. For our reference collection, the results indicate that theuse of local link information improves precision by 74%.When global link information is used, precision improves by35%. However, when only the first 10 documents in theranking are considered, the average gain in precision obtained withthe use of global link information is higher than the gain obtainedwith the use of local link information. This is an interestingresult since it provides insight and justification for the use ofglobal link information in major Web search engines, where usersare mostly interested in the first 10 answers. Furthermore, globalinformation can be computed in the background, which allowsspeeding up query processing.

[1]  Verner W. Clapp Research in problems of scientific information—retrospect and prospect† , 1963 .

[2]  M. M. Kessler Bibliographic coupling between scientific papers , 1963 .

[3]  Gerard Salton,et al.  AUTOMATIC INDEXING USING BIBLIOGRAPHIC CITATIONS , 1971 .

[4]  E. Garfield Citation analysis as a tool in journal evaluation. , 1972, Science.

[5]  Henry G. Small,et al.  Co-citation in the scientific literature: A new measure of the relationship between two documents , 1973, J. Am. Soc. Inf. Sci..

[6]  Michael E. D. Koenig,et al.  Journal clustering using a bibliographic coupling method , 1977, Inf. Process. Manag..

[7]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[8]  T. W. Anderson,et al.  The New Statistical Analysis of Data , 1986 .

[9]  Judea Pearl,et al.  Chapter 2 – BAYESIAN INFERENCE , 1988 .

[10]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[11]  W. Bruce Croft,et al.  Inference networks for document retrieval , 1989, SIGIR '90.

[12]  W. Bruce Croft,et al.  Evaluation of an inference network-based retrieval model , 1991, TOIS.

[13]  James Allan,et al.  Approaches to passage retrieval in full text information systems , 1993, SIGIR.

[14]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[15]  Yiyu Yao,et al.  On modeling information retrieval with probabilistic inference , 1995, TOIS.

[16]  W. Bruce Croft,et al.  Query expansion using local and global document analysis , 1996, SIGIR '96.

[17]  Berthier A. Ribeiro-Neto,et al.  A belief network model for IR , 1996, SIGIR '96.

[18]  Jeremy D. Finn,et al.  The SPSS Guide to the New Statistical Analysis of Data: by T.W. Anderson and Jeremy D. Finn , 1997 .

[19]  David Hawking,et al.  Overview of TREC-7 Very Large Collection Track , 1997, TREC.

[20]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[21]  Krishna Bharat,et al.  Improved algorithms for topic distillation in a hyperlinked environment , 1998, SIGIR '98.

[22]  Jon M. Kleinberg,et al.  Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text , 1998, Comput. Networks.

[23]  M. KleinbergJon Authoritative sources in a hyperlinked environment , 1999 .

[24]  Donna K. Harman,et al.  Results and Challenges in Web Search Evaluation , 1999, Comput. Networks.

[25]  Berthier A. Ribeiro-Neto,et al.  CoBWeb-a crawler for the Brazilian Web , 1999, 6th International Symposium on String Processing and Information Retrieval. 5th International Workshop on Groupware (Cat. No.PR00268).

[26]  Monika Henzinger,et al.  Finding Related Pages in the World Wide Web , 1999, Comput. Networks.

[27]  Ian H. Witten,et al.  Managing gigabytes (2nd ed.): compressing and indexing documents and images , 1999 .

[28]  Ravi Kumar,et al.  Trawling the Web for Emerging Cyber-Communities , 1999, Comput. Networks.

[29]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[30]  Berthier A. Ribeiro-Neto,et al.  Link-based and content-based evidential information in a belief network model , 2000, SIGIR '00.

[31]  Eli Upfal,et al.  The Web as a graph , 2000, PODS.

[32]  David Cohn,et al.  Learning to Probabilistically Identify Authoritative Documents , 2000, ICML.

[33]  Fabio Crestani,et al.  Soft computing in information retrieval: techniques and applications , 2000 .

[34]  Shlomo Moran,et al.  The stochastic approach for link-structure analysis (SALSA) and the TKC effect , 2000, Comput. Networks.

[35]  Mark Carpenter,et al.  The New Statistical Analysis of Data , 2000, Technometrics.

[36]  W. Bruce Croft,et al.  Improving the effectiveness of information retrieval with local context analysis , 2000, TOIS.

[37]  Susan T. Dumais,et al.  Probabilistic combination of content and links , 2001, SIGIR '01.

[38]  Tapas Kanungo,et al.  Integrating Link Structure and Content Information for Ranking Web Documents , 2001, TREC.

[39]  Shlomo Moran,et al.  SALSA: the stochastic approach for link-structure analysis , 2001, TOIS.

[40]  Amanda Spink,et al.  Searching the Web: the public and their queries , 2001 .

[41]  Djoerd Hiemstra,et al.  Retrieving Web Pages Using Content, Links, URLs and Anchors , 2001, TREC.

[42]  Min Zhang,et al.  TREC-10 Web Track Experiments at MSRA , 2001, TREC.

[43]  David Hawking,et al.  Overview of the TREC-2001 Web track , 2002 .