A Thread-wise Strategy for Incremental Crawling of Web Forums

We study in this paper the problem of incremental crawling of web forums, which is a very fundamental yet challenging step in many web applications. Traditional approaches mainly focus on scheduling the revisiting strategy of each individual page. However, simply assigning different weights for different individual pages are usually inefficient in crawling forum sites because of different characteristics between forum sites and general websites. Instead of treating each individual page independently, we propose a thread-wise strategy by taking into account thread-level statistics, for example, the number of replies and the frequency of replies, to estimate the activity trend of each thread. To extract such statistical information, we develop a simple yet very robust approach to extracting the timestamp of each post in a discussion thread. We also employ a regression model to predict the time of the next post for each thread. Based on this model, we developed a highly efficient crawler which is 2.6 times faster than state-of-the-art methods in terms of fetching new generated content, and meanwhile can still ensure a high coverage ratio. Experimental results show encouraging performance of Coverage, Bandwidth utilization, and Age for our approach on various forums.

[1]  Zhen Liu,et al.  Optimal Robot Scheduling for Web Search Engines , 1998 .

[2]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[3]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[4]  George Cybenko,et al.  Keeping up with the changing Web , 2000, Computer.

[5]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[6]  George Cybenko,et al.  How dynamic is the Web? , 2000, Comput. Networks.

[7]  Jenny Edwards,et al.  An adaptive model for optimizing performance of an incremental web crawler , 2001, WWW '01.

[8]  Filippo Menczer,et al.  Evaluating topic-driven web crawlers , 2001, SIGIR '01.

[9]  Philip S. Yu,et al.  Optimal crawling strategies for web search engines , 2002, WWW '02.

[10]  Hector Garcia-Molina,et al.  Effective page refresh policies for Web crawlers , 2003, TODS.

[11]  Sandeep Pandey,et al.  User-centric Web crawling , 2005, WWW '05.

[12]  Ricardo A. Baeza-Yates,et al.  Crawling a country: better strategies than breadth-first for web page ordering , 2005, WWW '05.

[13]  Bing Liu,et al.  Structured Data Extraction from the Web Based on Partial Tree Alignment , 2006, IEEE Transactions on Knowledge and Data Engineering.

[14]  Edleno Silva de Moura,et al.  Structure-driven crawler generation by example , 2006, SIGIR.

[15]  Ruihua Song,et al.  Joint optimization of wrapper generation and template detection , 2007, KDD '07.

[16]  Sandeep Pandey,et al.  Recrawl scheduling based on information longevity , 2008, WWW.

[17]  Yida Wang,et al.  iRobot: an intelligent crawler for web forums , 2008, WWW.

[18]  Yida Wang,et al.  Exploring traversal strategy for web forum crawling , 2008, SIGIR '08.

[19]  Young-In Song,et al.  Finding question-answer pairs from online forums , 2008, SIGIR '08.