News Page Discovery Policy for Instant Crawlers

Many news pages which are of high freshness requirements are published on the internet every day. They should be downloaded immediately by instant crawlers. Otherwise, they will become outdated soon. In the past, instant crawlers only downloaded pages from a manually generated news website list. Bandwidth is wasted in downloading non-news pages because news websites do not publish news pages exclusively. In this paper, a novel approach is proposed to discover news pages. This approach includes seed selection and news URL prediction based on user behavior analysis. Empirical studies in a user access log for two months show that our approach outperforms the traditional approach in both precision and recall.

[1]  Marc Najork,et al.  A large‐scale study of the evolution of Web pages , 2003, WWW '03.

[2]  Filippo Menczer,et al.  Topical web crawlers: Evaluating adaptive algorithms , 2004, TOIT.

[3]  Serge Abiteboul,et al.  Adaptive on-line page importance computation , 2003, WWW '03.

[4]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[5]  J. Curran,et al.  Domain-specific Web site identification: the CROSSMARC focused Web crawler , 2003 .

[6]  George Cybenko,et al.  How dynamic is the Web? , 2000, Comput. Networks.

[7]  Kevin S. McCurley,et al.  Ranking the web frontier , 2004, WWW '04.

[8]  Kevin S. McCurley,et al.  Locality, Hierarchy, and Bidirectionality in the Web∗ , 2003 .

[9]  Filippo Menczer,et al.  Topical Crawling for Business Intelligence , 2003, ECDL.

[10]  Hector Garcia-Molina,et al.  Effective page refresh policies for Web crawlers , 2003, TODS.

[11]  Filippo Menczer,et al.  Adaptive Retrieval Agents: Internalizing Local Context and Scaling up to the Web , 2000, Machine Learning.

[12]  Torsten Suel,et al.  Design and implementation of a high-performance distributed Web crawler , 2002, Proceedings 18th International Conference on Data Engineering.

[13]  Stuart Macdonald,et al.  User Engagement in Research Data Curation , 2009, ECDL.

[14]  Ana Carolina Salgado,et al.  Looking at both the present and the past to efficiently update replicas of web content , 2005, WIDM '05.

[15]  Filippo Menczer,et al.  Topic-Driven Crawlers: Machine Learning Issues , 2002 .