Language Specific and Topic Focused Web Crawling

We describe an experiment on collecting large language and topic specific corpora automatically by using a focused Web crawler. Our crawler combines efficient crawling techniques with a common text classification tool. Given a sample corpus of medical documents, we automatically extract query phrases and then acquire seed URLs with a standard search engine. Starting from these seed URLs, the crawler builds a new large collection consisting only of documents that satisfy both the language and the topic model. The manual analysis of acquired English and German medicine corpora reveals the high accuracy of the crawler. However, there are significant differences between both languages.

[1]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[2]  Rayid Ghani,et al.  Mining the web to create minority language corpora , 2001, CIKM '01.

[3]  Michele Banko,et al.  Scaling to Very Very Large Corpora for Natural Language Disambiguation , 2001, ACL.

[4]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[5]  Philip S. Yu,et al.  Intelligent crawling on the World Wide Web with arbitrary predicates , 2001, WWW '01.

[6]  J. Curran,et al.  Domain-specific Web site identification: the CROSSMARC focused Web crawler , 2003 .

[7]  Ahmed Patel,et al.  Building Topic-Specific Collections with Intelligent Agents , 1999, IS&N.

[8]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[9]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[10]  Adam Rifkin,et al.  Nutch: A Flexible and Scalable Open-Source Web Search Engine , 2005 .

[11]  Jimmy J. Lin,et al.  Web question answering: is more always better? , 2002, SIGIR '02.

[12]  Masaru Kitsuregawa,et al.  Simulation Study of Language Specific Web Crawling , 2005, 21st International Conference on Data Engineering Workshops (ICDEW'05).

[13]  D. Lindberg,et al.  The Unified Medical Language System , 1993, Yearbook of Medical Informatics.

[14]  Preslav Nakov,et al.  A study of using search engine page hits as a proxy for n-gram frequencies , 2005 .