BINGO!: bookmark-induced gathering of information

Focused (thematic) crawling is a relatively new, promising approach to improving the recall of expert search on the Web. It involves the automatic classification of visited documents into a user- or community-specific topic hierarchy (ontology). The quality of training data for the classifier is the most critical issue and a potential bottleneck for the effectivity and scale of a focused crawler. This paper presents the BINGO! approach to focused crawling that aims to overcome the limitations of initial training data. To this end, BINGO! identifies, among the crawled and positively classified documents of a topic, characteristic "archetypes" and uses them for periodically re-training the classifier; this way the crawler is dynamically adapted based on the most significant documents seen so far. Two kinds of archetypes are considered: good authorities as determined by employing Kleinberg's (1999) link analysis algorithm, and documents that have been automatically classified with high confidence using a linear SVM classifier. Our approach is fully implemented in the BINGO! system, and our experiments indicate that the dynamic enhancement of training data based on archetypes extends the "knowledge base" of the classifier by a substantial margin without loss of classification accuracy.

[1]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[2]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[3]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[4]  Susan T. Dumais,et al.  Bringing order to the Web: automatically categorizing search results , 2000, CHI.

[5]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[6]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[7]  Gerhard Weikum,et al.  Adding Relevance to XML , 2000, WebDB.

[8]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[9]  Prabhakar Raghavan,et al.  Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies , 1998, The VLDB Journal.

[10]  Thorsten Joachims,et al.  A statistical learning learning model of text classification for support vector machines , 2001, SIGIR '01.

[11]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[12]  Susan T. Dumais,et al.  Hierarchical classification of Web content , 2000, SIGIR '00.

[13]  Torsten Suel,et al.  Design and implementation of a high-performance distributed Web crawler , 2002, Proceedings 18th International Conference on Data Engineering.

[14]  Thorsten Joachims,et al.  A Statistical Learning Model of Text Classification for Support Vector Machines. , 2001, SIGIR 2002.

[15]  Marc Najork,et al.  Mercator: A scalable, extensible Web crawler , 1999, World Wide Web.

[16]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[17]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[18]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[19]  Alberto O. Mendelzon,et al.  What is this page known for? Computing Web page reputations , 2000, Comput. Networks.

[20]  Daphne Koller,et al.  Hierarchically Classifying Documents Using Very Few Words , 1997, ICML.

[21]  Eli Upfal,et al.  The Web as a graph , 2000, PODS.