Designing New Crawling and Indexing Techniques for Web Search Engines

The World Wide Web is growing and changing at an astonishing rate. Web information systems such as search engines have to keep up with the growth and changes of the Web. This thesis studies in a Web search engine how a crawler with limited computing resource can effectively crawl from the dynamically changing Web and acquire the most updated Web documents, and how a Web search engine can provide information-object-oriented indexing methods which enable users to retrieve desired information with high accuracy and high efficiency. To address the first problem, existing solutions apply sampling techniques at the website level. That means, the crawler chooses a few webpages from each website as samples and then re-downloads all the webpages in the website with the largest number of changed samples. We argue that the level of a website may not be a good granularity for sampling because the update patterns of various webpages within the same website could be quite different, while webpages with similar update patterns may distribute across different websites. We design a set of sampling policies with various downloading granularities for the sampling method, taking into account the link structure, the directory structure, and the content-based features which include the clustering technique. We further extend the clustering-based sampling approach by testing more dynamic features and strategically selecting samples from each cluster. For the second problem, once the crawler has downloaded a set of documents and stored them in the search engine, a search engine should allow users to perform accurate search for desired information. As more and more digital documents containing various information objects become accessible on the Web, there is a growing demand for a Web search system to provide users with tools to retrieve documents based on these information objects. The key challenges of this problem are that, a search engine needs to (1) improve the accuracy of returned ranking, (2) enrich the format of search objects, and (3) give informative results to users in different domains. Existing search engines typically maintain large-scale inverted indexes which are built on the whole local data set. These approaches do not focus on information objects in a specific domain. Therefore, they do not meet the above requirements. There are degradations in the accuracy of the returned ranking. To fully address these issues, we propose building indexes on extracted metadata of various information objects, instead of the whole document. This greatly improves the quality of the final returned ranking. As part of this dissertation, we set up a digital library, namely ArchSeer, for the domain of archeology. Archaeologists have different search needs which cannot be provided by a general purpose search engine associated with a digital library, like Google Scholar. Therefore, the need arises for a digital library like ArchSeer, which allows users to retrieve archeology literature via domain-specific search engines. For example, archaeologists often publish maps in their documents and need to search using geo-spatial references. In this dissertation, we show how to design a digital library that performs domain-specific information extraction and indexes them to allow user enhanced search capabilities. The most significant feature of ArchSeer is that it can automatically extract metadata related to different scientific items (e.g., maps and locations) in archeology papers, and further design effective ranking algorithms and heuristics to retrieve these items in the system. This dissertation also provides solid mathematical analyses, extensive simulations and experiments to evaluate the effectiveness and show the applicability of the proposed techniques. In addition, it discusses some open issues related to the proposed solutions and suggests some interesting directions in designing efficient Web search engines.

[1]  Mandar Mitra,et al.  Information Retrieval from Documents: A Survey , 2000, Information Retrieval.

[2]  Stephen E. Robertson,et al.  Simple BM25 extension to multiple weighted fields , 2004, CIKM '04.

[3]  Norman Abramson,et al.  Information theory and coding , 1963 .

[4]  Hector Garcia-Molina,et al.  Effective page refresh policies for Web crawlers , 2003, TODS.

[5]  J Patrick Bixler Tracking text in mixed-mode documents , 2000, DOCPROCS '88.

[6]  Filippo Menczer,et al.  Evaluating topic-driven web crawlers , 2001, SIGIR '01.

[7]  Erik Rauch,et al.  A confidence-based framework for disambiguating geographic terms , 2003, HLT-NAACL 2003.

[8]  Edward A. Fox,et al.  ETANA-GIS: GIS for archaeological digital libraries , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[9]  Josef Kittler,et al.  Pattern recognition : a statistical approach , 1982 .

[10]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[11]  Serguei Levachkine,et al.  Text/Graphics Separation and Recognition in Raster-Scanned Color Cartographic Maps , 2003, GREC.

[12]  Thorsten Joachims,et al.  Making large-scale support vector machine learning practical , 1999 .

[13]  Marc Ehrig,et al.  Ontology-focused crawling of Web documents , 2003, SAC '03.

[14]  Marc Najork,et al.  Mercator: A scalable, extensible Web crawler , 1999, World Wide Web.

[15]  Alexandros Ntoulas,et al.  Effective Change Detection Using Sampling , 2002, VLDB.

[16]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[17]  Yoelle Maarek,et al.  The Shark-Search Algorithm. An Application: Tailored Web Site Mapping , 1998, Comput. Networks.

[18]  Prasenjit Mitra,et al.  Automatic Extraction of Data from 2-D Plots in Documents , 2007 .

[19]  Kevin S. McCurley,et al.  Geospatial mapping and navigation of the web , 2001, WWW '01.

[20]  Craig A. Knoblock,et al.  Automatic extraction of road intersections from raster maps , 2005, GIS '05.

[21]  Christopher Olston,et al.  What's new on the web?: the evolution of the web from a search engine perspective , 2004, WWW '04.

[22]  Kun Bai,et al.  TableRank: A Ranking Algorithm for Table Search and Retrieval , 2007, AAAI.

[23]  Junghoo Cho,et al.  Impact of search engines on page popularity , 2004, WWW '04.

[24]  Sougata Mukherjea,et al.  WTMS: a system for collecting and analyzing topic-specific Web information , 2000, Comput. Networks.

[25]  Kam-Fai Wong,et al.  A retrospective study of a hybrid document-context based retrieval model , 2007, Inf. Process. Manag..

[26]  Belur V. Dasarathy,et al.  Nearest neighbor (NN) norms: NN pattern classification techniques , 1991 .

[27]  Shlomo Moran,et al.  Predictive caching and prefetching of query results in search engines , 2003, WWW '03.

[28]  Anja Feldmann,et al.  Rate of Change and other Metrics: a Live Study of the World Wide Web , 1997, USENIX Symposium on Internet Technologies and Systems.

[29]  Kalina Bontcheva,et al.  GATE: an Architecture for Development of Robust HLT applications , 2002, ACL.

[30]  Ana Carolina Salgado,et al.  Looking at both the present and the past to efficiently update replicas of web content , 2005, WIDM '05.

[31]  Herbert Van de Sompel,et al.  The open archives initiative: building a low-barrier interoperability framework , 2001, JCDL '01.

[32]  Mike Thelwall,et al.  Citation and hyperlink networks , 2005 .

[33]  Marco Gori,et al.  Towards Next Generation CiteSeer: A Flexible Architecture for Digital Library Deployment , 2006, ECDL.

[34]  Michael E. Lesk,et al.  Creating a searchable map library via data mining , 2008, JCDL '08.

[35]  George Cybenko,et al.  How dynamic is the Web? , 2000, Comput. Networks.

[36]  Christos Faloutsos,et al.  An Efficient Pictorial Database System for PSQL , 1988, IEEE Trans. Software Eng..

[37]  Ron Sivan,et al.  Web-a-where: geotagging web content , 2004, SIGIR '04.

[38]  Patrice Enjalbert,et al.  Geographic reference analysis for geographic document querying , 2003, HLT-NAACL 2003.

[39]  Dragutin Petkovic,et al.  Query by Image and Video Content: The QBIC System , 1995, Computer.

[40]  Ravi Kumar,et al.  Visualizing tags over time , 2007, ACM Trans. Web.

[41]  Shih-Fu Chang,et al.  Image Retrieval: Current Techniques, Promising Directions, and Open Issues , 1999, J. Vis. Commun. Image Represent..

[42]  Marty Himmelstein Local Search: The Internet Is the Yellow Pages , 2005, Computer.

[43]  Mário J. Silva,et al.  Challenges and resources for evaluating geographical IR , 2005, GIR '05.

[44]  Kun Bai,et al.  TableSeer: automatic table metadata extraction and searching in digital libraries , 2007, JCDL '07.

[45]  Christos Faloutsos,et al.  Sampling from large graphs , 2006, KDD '06.

[46]  Hector Garcia-Molina,et al.  Crawler-Friendly Web Servers , 2000, PERV.

[47]  Philip S. Yu,et al.  Optimal crawling strategies for web search engines , 2002, WWW '02.

[48]  Geert-Jan Houben,et al.  Information Retrieval in Distributed Hypertexts , 1994, RIAO.

[49]  Jian-Kang Wu Content-Based Indexing of Multimedia Databases , 1997, IEEE Trans. Knowl. Data Eng..

[50]  Claudia Bauzer Medeiros,et al.  Discovering geographic locations in web pages using urban addresses , 2007, GIR '07.

[51]  Luis Gravano,et al.  Exploiting Geographical Location Information of Web Pages , 1999, WebDB.

[52]  Simone Santini,et al.  Integrated browsing and querying for image databases , 2000, IEEE MultiMedia.

[53]  R. A. Doney,et al.  4. Probability and Random Processes , 1993 .

[54]  Costas Armenakis,et al.  Survey of Work on Road Extraction in Aerial and Satellite Images , 2002 .

[55]  James Ze Wang,et al.  SIMPLIcity: Semantics-Sensitive Integrated Matching for Picture LIbraries , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[56]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[57]  Hector Garcia-Molina,et al.  The Evolution of the Web and Implications for an Incremental Crawler , 2000, VLDB.

[58]  John N. Tsitsiklis,et al.  Introduction to Probability , 2002 .

[59]  Dan Wu,et al.  On assigning place names to geography related web pages , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[60]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[61]  Edward A. Fox,et al.  Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[62]  Cheng Niu,et al.  Location Normalization for Information Extraction , 2002, COLING.

[63]  Hector Garcia-Molina,et al.  Synchronizing a database to improve freshness , 2000, SIGMOD 2000.

[64]  Mohammad Zubair,et al.  Search engine coverage of the OAI-PMH corpus , 2006, IEEE Internet Computing.

[65]  Christopher S. G. Khoo,et al.  G-Portal: a map-based digital library for distributed geospatial and georeferenced resources , 2002, JCDL '02.

[66]  Luis Gravano,et al.  Computing Geographical Scopes of Web Resources , 2000, VLDB.

[67]  Roy H. Campbell,et al.  Internet search engine freshness by Web server help , 2001, Proceedings 2001 Symposium on Applications and the Internet.

[68]  Edward A. Fox,et al.  ETANA-ADD: an interactive tool for integrating archaeological DL collections , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[69]  C. Lee Giles,et al.  Designing clustering-based web crawling policies for search engine crawlers , 2007, CIKM '07.

[70]  Ross Wilkinson,et al.  Effective retrieval of structured documents , 1994, SIGIR '94.

[72]  C. Lee Giles,et al.  Efficiently Detecting Webpage Updates Using Samples , 2007, ICWE.

[73]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[74]  James P. Callan,et al.  Combining document representations for known-item search , 2003, SIGIR.

[75]  Essam A. El-Kwae,et al.  Efficient content-based indexing of large image databases , 2000, TOIS.

[76]  Jochen L. Leidner,et al.  Grounding spatial named entities for information extraction and question answering , 2003, HLT-NAACL 2003.

[77]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[78]  Hanan Samet,et al.  MAGELLAN: Map Acquisition of GEographic Labels by Legend ANalysis , 1998, International Journal on Document Analysis and Recognition.

[79]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[80]  George Cybenko,et al.  Keeping up with the changing Web , 2000, Computer.

[81]  Edward A. Fox,et al.  A Content-Based Image Retrieval Service for Archaeology Collections , 2006, ECDL.

[82]  Sandip Debnath,et al.  Learning metadata from the evidence in an on-line citation matching scheme , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[83]  Sandeep Pandey,et al.  User-centric Web crawling , 2005, WWW '05.

[84]  Marc Najork,et al.  A large‐scale study of the evolution of Web pages , 2004, Softw. Pract. Exp..

[85]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[86]  Sung-Hyon Myaeng,et al.  A flexible model for retrieval of SGML documents , 1998, SIGIR '98.

[87]  C. Lee Giles,et al.  Extraction and search of chemical formulae in text documents on the web , 2007, WWW '07.

[88]  Cheng Niu,et al.  InfoXtract: A Customizable Intermediate Level Information Extraction Engine , 2003, Natural Language Engineering.

[89]  Marcel Worring,et al.  Content-Based Image Retrieval at the End of the Early Years , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[90]  M. Sanderson,et al.  Analyzing geographic queries , 2004 .

[91]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[92]  Philip S. Yu,et al.  Intelligent crawling on the World Wide Web with arbitrary predicates , 2001, WWW '01.

[93]  Guoray Cai GeoVIBE: A Visual Interface for Geographic Digital Libraries , 2002, Visual Interfaces to Digital Libraries.

[94]  Mounia Lalmas Uniform Representation of Content and Structure for structured document retrieval , 2001 .

[95]  King-Sun Fu,et al.  Query-by-Pictorial-Example , 1980, IEEE Trans. Software Eng..

[96]  Ingemar J. Cox,et al.  The Bayesian image retrieval system, PicHunter: theory, implementation, and psychophysical experiments , 2000, IEEE Trans. Image Process..

[97]  Ioannis A. Kakadiaris,et al.  Understanding diagrams in technical documents , 1992, Computer.

[98]  William C. Schefler,et al.  Statistics: Concepts and Applications , 1988 .

[99]  G Salton,et al.  Developments in Automatic Text Retrieval , 1991, Science.

[100]  Hyun Chul Lee,et al.  Geographically-Sensitive Link Analysis , 2007, IEEE/WIC/ACM International Conference on Web Intelligence (WI'07).

[101]  Jenny Edwards,et al.  An adaptive model for optimizing performance of an incremental web crawler , 2001, WWW '01.

[102]  Sandeep Pandey,et al.  Recrawl scheduling based on information longevity , 2008, WWW.

[103]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[104]  Michael Stonebraker,et al.  Chabot: Retrieval from a Relational Database of Images , 1995, Computer.

[105]  Chew Lim Tan,et al.  Text/Graphics Separation in Maps , 2001, GREC.

[106]  Judit Bar-Ilan,et al.  Methods for comparing rankings of search engine results , 2005, Comput. Networks.

[107]  Cheng Niu,et al.  InfoXtract: a customizable intermediate level information extraction engine , 2003, HLT-NAACL 2003.

[108]  Clement T. Yu,et al.  Techniques and Systems for Image and Video Retrieval , 1999, IEEE Trans. Knowl. Data Eng..

[109]  C. Lee Giles,et al.  Classification of source code archives , 2003, SIGIR '03.

[110]  Monika Henzinger,et al.  Analysis of a very large web search engine query log , 1999, SIGF.

[111]  José Luis Borbinha,et al.  Geographically-aware information retrieval for collections of digitized historical maps , 2007, GIR '07.

[112]  Hanan Samet,et al.  MARCO: MAp Retrieval by COntent , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[113]  Hugh E. Williams,et al.  What's Changed? Measuring Document Change in Web Crawling for Search Engines , 2003, SPIRE.

[114]  James Ze Wang,et al.  Automatic categorization of figures in scientific documents , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[115]  Gregory R. Crane,et al.  Disambiguating Geographic Names in a Historical Digital Library , 2001, ECDL.

[116]  Yang Song,et al.  CiteSeerχ: a scalable autonomous scientific digital library , 2006, InfoScale '06.

[117]  George Karypis,et al.  Fast supervised dimensionality reduction algorithm with applications to document categorization & retrieval , 2000, CIKM '00.

[118]  C. Lee Giles,et al.  Digital Libraries and Autonomous Citation Indexing , 1999, Computer.