Effective web crawling

The key factors for the success of the World Wide Web are its large size and the lack of a centralized control over its contents. Both issues are also the most important source of problems for locating information. The Web is a context in which traditional Information Retrieval methods are challenged, and given the volume of the Web and its speed of change, the coverage of modern search engines is relatively small. Moreover, the distribution of quality is very skewed, and interesting pages are scarce in comparison with the rest of the content.

[1]  Alistair Moffat,et al.  Performance and Cost Tradeoffs in Web Search , 2004, ADC.

[2]  Daniel Gomes,et al.  A Characterization of the Portuguese Web , 2003 .

[3]  John A. Tomlin,et al.  A new paradigm for ranking pages on the world wide web , 2003, WWW '03.

[4]  Ricardo A. Baeza-Yates,et al.  Web Structure, Dynamics and Page Quality , 2002, SPIRE.

[5]  Kurt Rothermel,et al.  Maintaining Specialized Search Engines through Mobile Filter Agents , 1999, CIA.

[6]  JaimesA.,et al.  On the image content of a web segment , 2004 .

[7]  Hector Garcia-Molina,et al.  Performance of Inverted Indices in Distributed Text Document Retrieval Systems , 1993 .

[8]  Terrence A. Brooks,et al.  Web search: how the Web has changed information retrieval , 2003, Information Research.

[9]  Ricardo A. Baeza-Yates,et al.  Crawling the Infinite Web: Five Levels Are Enough , 2004, WAW.

[10]  Roy H. Campbell,et al.  Internet search engine freshness by Web server help , 2001, Proceedings 2001 Symposium on Applications and the Internet.

[11]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[12]  Lada A. Adamic,et al.  Evolutionary Dynamics of the World Wide Web , 1999 .

[13]  Sriram Raghavan,et al.  Searching the Web , 2001, ACM Trans. Internet Techn..

[14]  Richard L. Tweedie,et al.  Markov Chains and Stochastic Stability , 1993, Communications and Control Engineering Series.

[15]  Albert,et al.  Emergence of scaling in random networks , 1999, Science.

[16]  Anja Feldmann,et al.  Potential benefits of delta encoding and data compression for HTTP , 1997, SIGCOMM '97.

[17]  Marco Gori,et al.  A unified probabilistic framework for Web page scoring systems , 2004, IEEE Transactions on Knowledge and Data Engineering.

[18]  Saul Greenberg,et al.  Revisitation patterns in World Wide Web navigation , 1997, CHI.

[19]  Torsten Suel,et al.  Design and implementation of a high-performance distributed Web crawler , 2002, Proceedings 18th International Conference on Data Engineering.

[20]  Brigitte Trousse,et al.  Advanced data preprocessing for intersites Web usage mining , 2004, IEEE Intelligent Systems.

[21]  Berthier A. Ribeiro-Neto,et al.  CoBWeb-a crawler for the Brazilian Web , 1999, 6th International Symposium on String Processing and Information Retrieval. 5th International Workshop on Groupware (Cat. No.PR00268).

[22]  Binzhang Liu Characterizing Web Response Time , 1998 .

[23]  Marc Najork,et al.  Mercator: A scalable, extensible Web crawler , 1999, World Wide Web.

[24]  Ricardo A. Baeza-Yates,et al.  Web page ranking using link attributes , 2004, WWW Alt. '04.

[25]  M. Kendall Rank Correlation Methods , 1949 .

[26]  Ricardo A. Baeza-Yates,et al.  Web Dynamics, Structure, and Page Quality , 2004, Web Dynamics.

[27]  Marc Najork,et al.  On near-uniform URL sampling , 2000, Comput. Networks.

[28]  Huberman,et al.  Strong regularities in world wide web surfing , 1998, Science.

[29]  Brian D. Davison Topical locality in the Web , 2000, SIGIR '00.

[30]  Serge Abiteboul,et al.  Adaptive on-line page importance computation , 2003, WWW '03.

[31]  Sebastiano Vigna,et al.  The webgraph framework I: compression techniques , 2004, WWW '04.

[32]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[33]  Zhen Liu,et al.  Optimal Robot Scheduling for Web Search Engines , 1998 .

[34]  Ricardo A. Baeza-Yates,et al.  Content-Based Image Retrieval and Characterization on Specific Web Collections , 2004, CIVR.

[35]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[36]  Hector Garcia-Molina,et al.  The Evolution of the Web and Implications for an Incremental Crawler , 2000, VLDB.

[37]  Kevin S. McCurley,et al.  Ranking the web frontier , 2004, WWW '04.

[38]  Hector Garcia-Molina,et al.  Parallel crawlers , 2002, WWW.

[39]  Vipin Kumar,et al.  Discovery of Web Robot Sessions Based on their Navigational Patterns , 2004, Data Mining and Knowledge Discovery.

[40]  Jeffrey Scott Vitter,et al.  Characterizing Web Document Change , 2001, WAIM.

[41]  Ricardo A. Baeza-Yates Challenges in the Interaction of Information Retrieval and Natural Language Processing , 2004, CICLing.

[42]  Iadh Ounis,et al.  A utility-oriented hyperlink analysis model for the Web , 2003, Proceedings of the IEEE/LEOS 3rd International Conference on Numerical Simulation of Semiconductor Optoelectronic Devices (IEEE Cat. No.03EX726).

[43]  Sriram Raghavan,et al.  Crawling the Hidden Web , 2001, VLDB.

[44]  Anja Feldmann,et al.  Rate of Change and other Metrics: a Live Study of the World Wide Web , 1997, USENIX Symposium on Internet Technologies and Systems.

[45]  Giles,et al.  Searching the world wide Web , 1998, Science.

[46]  Danny B. Lange,et al.  Seven good reasons for mobile agents , 1999, CACM.

[47]  Marina Buzzi,et al.  Cooperative crawling , 2003, Proceedings of the IEEE/LEOS 3rd International Conference on Numerical Simulation of Semiconductor Optoelectronic Devices (IEEE Cat. No.03EX726).

[48]  B. Huberman,et al.  Surfing as a real option , 1998, ICE '98.

[49]  Ricardo A. Baeza-Yates,et al.  Scheduling algorithms for Web crawling , 2004, WebMedia and LA-Web, 2004. Proceedings.

[50]  Jenny Edwards,et al.  An adaptive model for optimizing performance of an incremental web crawler , 2001, WWW '01.

[51]  Hector Garcia-Molina,et al.  Estimating frequency of change , 2003, TOIT.

[52]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[53]  Andrei Z. Broder,et al.  A Comparison of Techniques to Find Mirrored Hosts on the WWW , 2000, IEEE Data Eng. Bull..

[54]  Torsten Suel,et al.  Server-Friendly Delta Compression for Efficient Web Access , 2003, WCW.

[55]  Ricardo A. Baeza-Yates,et al.  Evolution of the Chilean Web structure composition , 2003, Proceedings of the IEEE/LEOS 3rd International Conference on Numerical Simulation of Semiconductor Optoelectronic Devices (IEEE Cat. No.03EX726).

[56]  M. Koster,et al.  Robots in the Web : threat or treat ? , 1995, WWW Spring 1995.

[57]  Hector Garcia-Molina,et al.  Effective page refresh policies for Web crawlers , 2003, TODS.

[58]  Diomidis Spinellis,et al.  The decay and failures of web references , 2003, CACM.

[59]  Sebastiano Vigna,et al.  UbiCrawler: a scalable fully distributed Web crawler , 2004, Softw. Pract. Exp..

[60]  Ravi Kumar,et al.  Self-similarity in the web , 2001, TOIT.

[61]  Hector Garcia-Molina,et al.  Synchronizing a database to improve freshness , 2000, SIGMOD 2000.

[62]  J. M. Bevan,et al.  Rank Correlation Methods , 1949 .

[63]  Carlos Castillo Cooperation schemes between a Web server and a Web search engine , 2003, Proceedings of the IEEE/LEOS 3rd International Conference on Numerical Simulation of Semiconductor Optoelectronic Devices (IEEE Cat. No.03EX726).

[64]  Krishna Bharat,et al.  SPHINX: A Framework for Creating Personal, Site-Specific Web Crawlers , 1998, Comput. Networks.

[65]  Knut Magne Risvik,et al.  Search engines and Web dynamics , 2002, Comput. Networks.

[66]  Luis Gravano,et al.  STARTS: Stanford Proposal for Internet Meta-Searching (Experience Paper) , 1997, SIGMOD Conference.

[67]  Marc Najork,et al.  Breadth-first crawling yields high-quality pages , 2001, WWW '01.

[68]  Jiming Liu,et al.  Characterizing Web usage regularities with information foraging agents , 2004, IEEE Transactions on Knowledge and Data Engineering.

[69]  Jerome Talim,et al.  Controlling the robots of Web search engines , 2001, SIGMETRICS '01.

[70]  Marc Najork,et al.  A large‐scale study of the evolution of Web pages , 2004, Softw. Pract. Exp..

[71]  David Eichmann,et al.  The RBSE spider — Balancing effective search against Web load , 1994, WWW Spring 1994.

[72]  Ricardo A. Baeza-Yates,et al.  On the image content of the Chilean Web , 2003, Proceedings of the IEEE/LEOS 3rd International Conference on Numerical Simulation of Semiconductor Optoelectronic Devices (IEEE Cat. No.03EX726).

[73]  David W. Brooks,et al.  “Link rot” limits the usefulness of web‐based educational materials in biochemistry and molecular biology * , 2003 .

[74]  Junghoo Cho,et al.  Page quality: in search of an unbiased web ranking , 2005, SIGMOD '05.

[75]  George Cybenko,et al.  How dynamic is the Web? , 2000, Comput. Networks.

[76]  Ricardo A. Baeza-Yates,et al.  On the Image Content of a Web Segment: Chile as a Case Study , 2004, J. Web Eng..

[77]  Jon M. Kleinberg,et al.  Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text , 1998, Comput. Networks.

[78]  Marios D. Dikaiakos,et al.  Design and Implementation of a Distributed Crawler and Filtering Processor , 2002, NGITS.

[79]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[80]  Albert-László Barabási,et al.  The physics of the Web , 2001 .

[81]  Ricardo A. Baeza-Yates,et al.  Relating Web Characteristics with Link Based Web Page Ranking , 2001, SPIRE.

[82]  Torsten Suel,et al.  Compressing the graph structure of the Web , 2001, Proceedings DCC 2001. Data Compression Conference.

[83]  Oliver A. McBryan,et al.  GENVL and WWWW: Tools for taming the Web , 1994, WWW Spring 1994.

[84]  James E. Pitkow,et al.  Characterizing Browsing Behaviors on the World-Wide Web , 1995 .

[85]  Edward A. Fox,et al.  Web Traffic Latency: Characteristics and Implications , 1998, J. Univers. Comput. Sci..

[86]  Sebastiano Vigna,et al.  Do Your Worst to Make the Best: Paradoxical Effects in PageRank Incremental Computations , 2004, WAW.

[87]  Taher H. Haveliwala Topic-sensitive PageRank , 2002, IEEE Trans. Knowl. Data Eng..

[88]  Ricardo A. Baeza-Yates,et al.  Balancing Volume, Quality and Freshness in Web Crawling , 2002, HIS.

[89]  Ricardo A. Baeza-Yates,et al.  Dynamics of the Chilean Web Structure , 2004, WebDyn@WWW.

[90]  Eli Upfal,et al.  Stochastic models for the Web graph , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[91]  Dik Lun Lee,et al.  Search and ranking algorithms for locating resources on the World Wide Web , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[92]  Wallace Koehler,et al.  A longitudinal study of Web pages continued: a consideration of document persistence , 2003, Inf. Res..

[93]  Béla Bollobás,et al.  Random Graphs , 1985 .

[94]  Luis Gravano,et al.  STARTS: Stanford proposal for Internet meta-searching , 1997, SIGMOD '97.

[95]  C. Lee Giles,et al.  Accessibility of information on the Web , 2000, INTL.

[96]  Susan Haigh,et al.  Measuring Web Site Usage: Log File Analysis , 1998 .

[97]  Virgílio A. F. Almeida,et al.  In search of invariants for e-business workloads , 2000, EC '00.

[98]  Hector Garcia-Molina,et al.  Finding replicated Web collections , 2000, SIGMOD 2000.

[99]  Christopher Olston,et al.  What's new on the web?: the evolution of the web from a search engine perspective , 2004, WWW '04.

[100]  Anna Patterson Why Writing Your Own Search Engine Is Hard , 2004, ACM Queue.

[101]  Andreas Rauber,et al.  Uncovering Information Hidden in Web Archives: A Glimpse at Web Analysis Building on Data Warehouses , 2002, D Lib Mag..

[102]  Myra Spiliopoulou,et al.  Analysis of navigation behaviour in web sites integrating multiple information systems , 2000, The VLDB Journal.

[103]  Yanhong Li Toward A Qualitative Search Engine , 1998, IEEE Internet Comput..

[104]  Andrei Z. Broder,et al.  Sic transit gloria telae: towards an understanding of the web's decay , 2004, WWW '04.

[105]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[106]  Monika Henzinger,et al.  Hyperlink Analysis for the Web , 2001, IEEE Internet Comput..

[107]  Rick Kazman,et al.  WebQuery: Searching and Visualizing the Web Through Connectivity , 1997, Comput. Networks.

[108]  Martin Bergman,et al.  The deep web:surfacing the hidden value , 2000 .

[109]  Adam Kilgarriff,et al.  Introduction to the Special Issue on the Web as Corpus , 2003, CL.

[110]  Walid G. Aref,et al.  Databases deepen the Web , 2004, Computer.

[111]  Franco Scarselli,et al.  Design of a crawler with bounded bandwidth , 2004, WWW Alt. '04.

[112]  B. Pinkerton,et al.  Finding What People Want : Experiences with the WebCrawler , 1994, WWW Spring 1994.

[113]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[114]  Mark Levene,et al.  Zipf's Law for Web Surfers , 2001, Knowledge and Information Systems.

[115]  Margo I. Seltzer,et al.  World Wide Web Cache Consistency , 1996, USENIX Annual Technical Conference.

[116]  Marc Najork,et al.  Measuring Index Quality Using Random Walks on the Web , 1999, Comput. Networks.

[117]  David M. Pennock,et al.  Winners don't take all: Characterizing the competition for links on the web , 2002, Proceedings of the National Academy of Sciences of the United States of America.