Information retrieval on the web

Introduction How do we find information on the Web? Although information on the Web is distributed and decentralized, the Web can be viewed as a single, virtual document collection. In that regard, the fundamental questions and approaches of traditional information retrieval (IR) research (e.g., term weighting, query expansion) are likely to be relevant in Web document retrieva1.l Findings from traditional IR research, however, may not always be applicable in a Web setting. The Web document collection-massive in size and diverse in content, format, purpose, and quality-challenges the validity of previous research findings that are based on relatively small and homogeneous test collections. Moreover, some traditional IR approaches, although applicable in theory, may be impossible or impractical to implement in a Web setting. For instance, the size, distribution, and dynamic nature of Web information make it extremely difficult to construct a complete and up-to-date data representation of the kind required for a model IR system. To further complicate matters, information seeking on the Web is diverse in character and unpredictable in nature. Web searchers come from all walks of life and are motivated by many kinds of information needs. The wide range of experience, knowledge, motivation, and purpose means that searchers can express diverse types of information needs in a wide variety of ways with differing criteria for satisfying those needs. Conventional evaluation measures, such as precision and recall, may no longer be appropriate for Web IR, where a representative test collection is all but impossible to construct. Finding information on the Web creates many new challenges for, and exacerbates some old problems in, IR research. At the same time, the Web is rich in new types of information not present in most IR test collections. Hyperlinks, usage statistics, document markup tags, and collections of topic hierarchies such as Yahoo! (http://www.yahoo.com) present an opportunity to leverage Web-specific document characteristics in novel ways that go beyond the term-based retrieval framework of traditional IR. Consequently, researchers in Web IR have reexamined the findings from traditional IR research to discover which conventional

[1]  Alan F. Smeaton,et al.  Dublin City University Experiments in Connectivity Analysis for TREC-9 , 2000, TREC.

[2]  Peter Willett,et al.  Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[3]  David Hawking,et al.  Overview of the TREC-2001 Web track , 2002 .

[4]  Amit Singhal,et al.  A case study in web search using TREC algorithms , 2001, WWW '01.

[5]  Donna K. Harman,et al.  Overview of the Eighth Text REtrieval Conference (TREC-8) , 1999, TREC.

[6]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[7]  Jacques Savoy,et al.  Report on the TREC-8 Experiment: Searching on the Web and in Distributed Collections , 1999, TREC.

[8]  G Salton,et al.  Global Text Matching for Information Retrieval , 1991, Science.

[9]  Wei Zhang,et al.  Improvement of HITS-based algorithms on web documents , 2002, WWW '02.

[10]  Ellen Spertus,et al.  ParaSite: Mining Structural Information on the Web , 1997, Comput. Networks.

[11]  Monika Henzinger,et al.  Finding Related Pages in the World Wide Web , 1999, Comput. Networks.

[12]  David Hawking,et al.  Overview of the TREC-2002 Web Track , 2002, TREC.

[13]  Allison Woodruff,et al.  An Investigation of Documents from the World Wide Web , 1996, Comput. Networks.

[14]  W. Bruce Croft,et al.  Fast Incremental Indexing for Full-Text Information Retrieval , 1994, VLDB.

[15]  Peter Burden,et al.  Automatic Classification of Web Resources using Java and Dewey Decimal Classification , 1998, Comput. Networks.

[16]  Ellen M. Voorhees,et al.  Overview of TREC 2003 , 2003, TREC.

[17]  Henry G. Small,et al.  Co-citation in the scientific literature: A new measure of the relationship between two documents , 1973, J. Am. Soc. Inf. Sci..

[18]  Massimo Marchiori,et al.  The Quest for Correct Information on the Web: Hyper Search Engines , 1997, Comput. Networks.

[19]  Sougata Mukherjea,et al.  Organizing topic-specific web information , 2000, HYPERTEXT '00.

[20]  Ben Shneiderman,et al.  Navigating in hyperspace: designing a structure-based toolbox , 1994, CACM.

[21]  Brewster Kahle,et al.  Preserving the Internet , 1997 .

[22]  FlorescuDaniela,et al.  Database techniques for the World-Wide Web , 1998 .

[23]  C. Lee Giles,et al.  Digital Libraries and Autonomous Citation Indexing , 1999, Computer.

[24]  W. Bruce Croft,et al.  Retrieval Strategies for Hypertext , 1993, Inf. Process. Manag..

[25]  Edie M. Rasmussen,et al.  Indexing and retrieval for the Web , 2005, Annu. Rev. Inf. Sci. Technol..

[26]  David Hawking,et al.  Overview of the TREC 2003 Web Track , 2003, TREC.

[27]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[28]  Prabhakar Raghavan,et al.  Using Taxonomy, Discriminants, and Signatures for Navigating in Text Databases , 1997, VLDB.

[29]  Simon Buckingham Shum The missing link: hypermedia usability research and the Web , 1996, SGCH.

[30]  Gary Marchionini,et al.  Interfaces for end‐user information seeking , 1992 .

[31]  Ben Shneiderman,et al.  Structural analysis of hypertexts: identifying hierarchies and useful metrics , 1992, TOIS.

[32]  Jon M. Kleinberg,et al.  Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text , 1998, Comput. Networks.

[33]  Giles,et al.  Searching the world wide Web , 1998, Science.

[34]  Judit Bar-Ilan,et al.  The use of web search engines in information science research , 2005, Annu. Rev. Inf. Sci. Technol..

[35]  Joel C. Miller,et al.  Modifications of Kleinberg's HITS algorithm using matrix exponentiation and web log records , 2001, SIGIR '01.

[36]  W. M. Shaw Subject and Citation Indexing. Part I: The Clustering Structure of Composite Representations in the Cystic Fibrosis Document Collection. Part II: The Optimal, Cluster-Based Retrieval Performance of Composite Representations. , 1991 .

[37]  Blaise Cronin,et al.  Comparative citation rankings of authors in monographic and journal literature: a study of sociology , 1997, J. Documentation.

[38]  Yiqun Liu,et al.  THUIR at TREC 2003: Novelty, Robust and Web , 2003, TREC.

[39]  Jr. W. M. Shaw Subject and citation indexing. Part II: The optimal, cluster‐based retrieval performance of composite representations , 1991 .

[40]  Derek Wilton Langridge Classification: Its Kinds, Elements, Systems and Applications , 1992 .

[41]  Sougata Mukherjea,et al.  WTMS: a system for collecting and analyzing topic-specific Web information , 2000, Comput. Networks.

[42]  C. Lee Giles,et al.  Extracting query modifications from nonlinear SVMs , 2002, WWW '02.

[43]  Byron Anderson Archiving the Internet , 2005 .

[44]  Hwee Tou Ng,et al.  Mining topic-specific concepts and definitions on the web , 2003, WWW '03.

[45]  Hak-Joon Kim Motivation for hyperlinking in scholarly electronic articles: a qualitative study , 2000 .

[46]  M. M. Kessler Bibliographic coupling between scientific papers , 1963 .

[47]  Amanda Spink,et al.  Vox populi: The public searching of the web , 2001, J. Assoc. Inf. Sci. Technol..

[48]  Brian D. Davison Predicting web actions from HTML content , 2002, HYPERTEXT '02.

[49]  Hitoshi Isahara,et al.  Efficient Text Categorization Using a Min-Max Modular Support Vector Machine , 2006 .

[50]  Gary Marchionini,et al.  Interfaces for End-User Information Seeking , 1992, J. Am. Soc. Inf. Sci..

[51]  Hans-Peter Frei,et al.  The Use of Semantic Links in Hypertext Information Retrieval , 1995, Inf. Process. Manag..

[52]  Krishna Bharat,et al.  Improved algorithms for topic distillation in a hyperlinked environment , 1998, SIGIR '98.

[53]  Stephen E. Robertson,et al.  Effective site finding using link anchor information , 2001, SIGIR '01.

[54]  Iadh Ounis,et al.  University of Glasgow at the Web Track: Dynamic Application of Hyperlink Analysis using the Query Scope , 2003, TREC.

[55]  Agustin Schapira Collaboratively Searching the Web – An Initial Study , 1999 .

[56]  Amanda Spink,et al.  Searching the Web: the public and their queries , 2001 .

[57]  David Carmel,et al.  The connectivity sonar: detecting site functionality by structural patterns , 2003, HYPERTEXT '03.

[58]  Andrei Broder,et al.  A taxonomy of web search , 2002, SIGF.

[59]  Tom Carey,et al.  Labeled, typed links as cues when reading hypertext documents , 1996 .

[60]  Craig Silverstein,et al.  Analysis of a Very Large Altavista Query Log" SRC Technical note #1998-14 , 1998 .

[61]  Kevin S. McCurley,et al.  Untangling compound documents on the web , 2003, HYPERTEXT '03.

[62]  Christoph Hölscher,et al.  Web search behavior of Internet experts and newbies , 2000, Comput. Networks.

[63]  Alberto O. Mendelzon,et al.  Querying the World Wide Web , 1996, Fourth International Conference on Parallel and Distributed Information Systems.

[64]  Shuming Shi,et al.  Microsoft Research Asia at the Web Track of TREC 2009 , 2009, TREC.

[65]  Timothy W. Finin,et al.  Yahoo! as an ontology: using Yahoo! categories to describe documents , 1999, CIKM '99.

[66]  Ray R. Larson,et al.  Bibliometrics of the World Wide Web: An Exploratory Analysis of the Intellectual Structure of Cyberspace , 1996 .

[67]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[68]  C. Lee Giles,et al.  Searching the Web: general and scientific information access , 1999, First IEEE/POPOV Workshop on Internet Technologies and Services. Proceedings (Cat. No.99EX391).

[69]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[70]  Karen Sparck Jones Automatic keyword classification for information retrieval , 1971 .

[71]  James Allan,et al.  Automatic hypertext link typing , 1996 .

[72]  David M. Pennock,et al.  Using web structure for classifying and describing web pages , 2002, WWW.

[73]  Terence R. Smith,et al.  Browsing large digital library collections using classification hierarchies , 1999, CIKM '99.

[74]  David Carmel,et al.  Topic Distillation with Knowledge Agents , 2002, TREC.

[75]  Brian D. Davison Topical locality in the Web , 2000, SIGIR '00.

[76]  Ray R. Larson Experiments in automatic Library of Congress Classification , 1992 .

[77]  C. Lee Giles,et al.  Accessibility of information on the web , 1999, Nature.

[78]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[79]  David Hawking,et al.  Overview of the TREC-9 Web Track , 2000, TREC.

[80]  Donna K. Harman,et al.  Overview of the Second Text REtrieval Conference (TREC-2) , 1994, HLT.

[81]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[82]  Stephen Tomlinson Robust, Web and Genomic Retrieval with Hummingbird SearchServer at TREC 2003 , 2003, TREC.

[83]  Mark Bernstein,et al.  Patterns of hypertext , 1998, HYPERTEXT '98.

[84]  Shlomo Moran,et al.  SALSA: the stochastic approach for link-structure analysis , 2001, TOIS.

[85]  Wei-Ying Ma,et al.  Probabilistic query expansion using query logs , 2002, WWW '02.

[86]  Kiduk Yang Information retrieval on the Web : Information retrieval , 2005 .

[87]  Jacques Savoy,et al.  Report on the TREC-9 Experiment: Link-based Retrieval and Distributed Collections , 2000, TREC.

[88]  Andrew MacFarlane,et al.  Pliers at Trec 2002 , 2002, TREC.

[89]  W. Scott Spangler,et al.  Clustering hypertext with applications to web searching , 2000, HYPERTEXT '00.

[90]  Alberto O. Mendelzon,et al.  Database techniques for the World-Wide Web: a survey , 1998, SGMD.

[91]  Paul Dourish,et al.  Introduction to the special section on recommender systems , 2005, TCHI.

[92]  Oliver A. McBryan,et al.  GENVL and WWWW: Tools for taming the Web , 1994, WWW Spring 1994.

[93]  Jennifer Widom,et al.  Scaling personalized web search , 2003, WWW '03.

[94]  Ramana Rao,et al.  Silk from a sow's ear: extracting usable structures from the Web , 1996, CHI.

[95]  Amanda Spink,et al.  Failure analysis in query construction: data and analysis from a large sample of Web queries , 1998, DL '98.

[96]  Yiqun Liu,et al.  THU TREC2002 Web Track Experiments , 2002 .

[97]  Jonathan Furner,et al.  Scholarly communication and bibliometrics , 2005, Annu. Rev. Inf. Sci. Technol..

[98]  Sougata Mukherjea,et al.  Visualizing the World-Wide Web with the Navigational View Builder , 1995, Comput. Networks ISDN Syst..

[99]  Peter Bailey,et al.  Overview of the TREC-8 Web Track , 2000, TREC.

[100]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[101]  Ravi Kumar,et al.  Trawling the Web for Emerging Cyber-Communities , 1999, Comput. Networks.

[102]  James Allan,et al.  Automatic structuring and retrieval of large text files , 1994, CACM.

[103]  Marti A. Hearst,et al.  Reexamining the cluster hypothesis: scatter/gather on retrieval results , 1996, SIGIR '96.

[104]  Eli Upfal,et al.  Web search using automatic classification , 1996, WWW 1996.

[105]  Oren Etzioni,et al.  Grouper: A Dynamic Clustering Interface to Web Search Results , 1999, Comput. Networks.

[106]  Andrei Z. Broder,et al.  Graph structure in the Web , 2000, Comput. Networks.

[107]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[108]  Bin Liu,et al.  TREC 11 Experiments at CAS-ICT: Filtering and Web , 2002, TREC.

[109]  Wei-Ying Ma,et al.  Implicit link analysis for small web search , 2003, SIGIR '03.

[110]  Lennette Kipper Experience and Nature , 1930, Nature.

[111]  Gerard Salton,et al.  Research and Development in Information Retrieval , 1982, Lecture Notes in Computer Science.

[112]  Baoyao Zhou,et al.  Website link structure evaluation and improvement based on user visiting patterns , 2001, HYPERTEXT '01.

[113]  Daphne Koller,et al.  Hierarchically Classifying Documents Using Very Few Words , 1997, ICML.

[114]  Paul Resnick,et al.  Recommender systems , 1997, CACM.

[115]  David Hawking,et al.  TREC 12 Web Track at CSIRO , 2003 .

[116]  Pattie Maes,et al.  Social information filtering: algorithms for automating “word of mouth” , 1995, CHI '95.

[117]  Taher H. Haveliwala Topic-sensitive PageRank , 2002, IEEE Trans. Knowl. Data Eng..

[118]  Piotr Indyk,et al.  Enhanced hypertext categorization using hyperlinks , 1998, SIGMOD '98.

[119]  H. P. Frei,et al.  The use of semantic links in hypertext information retrieval , 1995 .

[120]  Alberto O. Mendelzon,et al.  Applications of a Web Query Language , 1997, Comput. Networks.

[121]  Djoerd Hiemstra,et al.  The Importance of Prior Probabilities for Entry Page Search , 2002, SIGIR '02.

[122]  David Carmel,et al.  Juru at TREC 2003 - Topic Distillation using Query-Sensitive Tuning and Cohesiveness Filtering , 2003, TREC.

[123]  Annabel Pollock,et al.  What''s Wrong with Internet Searching , 1997 .

[124]  Donna K. Harman,et al.  Results and Challenges in Web Search Evaluation , 1999, Comput. Networks.

[125]  Rick Kazman,et al.  WebQuery: Searching and Visualizing the Web Through Connectivity , 1997, Comput. Networks.

[126]  Ray R. Larson,et al.  Experiments in Automatic Library of Congress Classification , 1992, J. Am. Soc. Inf. Sci..

[127]  Chanathip Namprempre,et al.  HyPursuit: a hierarchical network search engine that exploits content-link hypertext clustering , 1996, HYPERTEXT '96.

[128]  Richard W. Kopak,et al.  Functional link typing in hypertext , 1999, CSUR.

[129]  Mark Weiser,et al.  TEXTNET: a network-based approach to text handling , 1986, TOIS.

[130]  Tim Bray,et al.  Measuring the Web , 1996, World Wide Web J..

[131]  Michelle Q. Wang Baldonado,et al.  SONIA: a service for organizing networked information autonomously , 1998, DL '98.

[132]  Amanda Spink,et al.  Real life information retrieval: a study of user queries on the Web , 1998, SIGF.

[133]  Amanda Spink,et al.  Real life, real users, and real needs: a study and analysis of user queries on the web , 2000, Inf. Process. Manag..

[134]  Albert-László Barabási,et al.  Linked - how everything is connected to everything else and what it means for business, science, and everyday life , 2003 .