From Focused Crawling to Expert Information: an Application Framework for Web Exploration and Portal Generation

Publisher Summary Focused crawling is a relatively new, promising approach to improving the recall of expert search on the Web. It typically starts from a user- or community specific tree of topics along with a few training documents for each tree node, and then crawls the Web with focus on these topics of interest. This process can efficiently build a theme-specific, hierarchical directory whose nodes are populated with relevant high-quality documents for expert Web search. The BINGO! focused crawler implements an approach that aims to overcome the limitations of the initial training data. BINGO! identifies, among the crawled and positively classified documents of a topic, characteristic archetypes and uses them for periodically retraining the classifier. This way the crawler is dynamically adapted based on the most significant documents seen so far.

[1]  Sriram Raghavan,et al.  Crawling the Hidden Web , 2001, VLDB.

[2]  Dimitrios Gunopulos,et al.  Architecture and Implementation of an XQuery-based Information Integration Platform. , 2002 .

[3]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[4]  Gerhard Weikum,et al.  The BINGO! System for Information Portal Generation and Expert Web Search , 2003, CIDR.

[5]  Soumen Chakrabarti,et al.  Mining Themes From Bookmarks , 2000 .

[6]  Gerhard Weikum,et al.  The Index-Based XXL Search Engine for Querying XML Data with Relevance Ranking , 2002, EDBT.

[7]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[8]  Roberto Basili,et al.  Learning to Classify Text Using Support Vector Machines: Methods, Theory, and Algorithms by Thorsten Joachims , 2003, Comput. Linguistics.

[9]  Thorsten Joachims,et al.  The Maximum-Margin Approach to Learning Text Classifiers , 2001, Künstliche Intell..

[10]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[11]  Setsuo Ohsuga,et al.  INTERNATIONAL CONFERENCE ON VERY LARGE DATA BASES , 1977 .

[12]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[13]  Susan T. Dumais,et al.  Hierarchical classification of Web content , 2000, SIGIR '00.

[14]  Gerhard Weikum,et al.  BINGO!: bookmark-induced gathering of information , 2002, Proceedings of the Third International Conference on Web Information Systems Engineering, 2002. WISE 2002..

[15]  Pedro M. Domingos,et al.  Representing and reasoning about mappings between domain models , 2002, AAAI/IAAI.

[16]  Arnaud Sahuguet,et al.  Building Light-Weight Wrappers for Legacy Web Data-Sources Using W4F , 1999, VLDB.

[17]  Susan T. Dumais,et al.  Bringing order to the Web: automatically categorizing search results , 2000, CHI.

[18]  Luis Gravano,et al.  Probe, count, and classify: categorizing hidden web databases , 2001, SIGMOD '01.

[19]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[20]  Steffen Staab,et al.  Ontology Learning for the Semantic Web , 2002, IEEE Intell. Syst..