Timely crawling of high-quality ephemeral new content

In this paper, we study the problem of timely finding and crawling of \textit{ephemeral} new pages, i.e., for which user traffic grows really quickly right after they appear, but lasts only for several days (e.g., news, blog and forum posts). Traditional crawling policies do not give any particular priority to such pages and may thus crawl them not quickly enough, and even crawl already obsolete content. We thus propose a new metric, well thought out for this task, which takes into account the decrease of user interest for ephemeral pages over time. We show that most ephemeral new pages can be found at a relatively small set of content sources and suggest a method for finding such a set. Our idea is to periodically recrawl content sources and crawl newly created pages linked from them, focusing on high-quality (in terms of user interest) content. One of the main difficulties here is to divide resources between these two activities in an efficient way. We find the adaptive balance between crawls and recrawls by maximizing the proposed metric. Further, we incorporate search engine click logs to give our crawler an insight about the current user demands. The effectiveness of our approach is finally demonstrated experimentally on real-world data.

[1]  Paul N. Bennett,et al.  Predicting content change on the web , 2013, WSDM.

[2]  Berkant Barla Cambazoglu,et al.  Discovering URLs through user feedback , 2011, CIKM '11.

[3]  Norman Matloff Estimation of internet file-access/modification rates from indirect data , 2005, TOMC.

[4]  Hector Garcia-Molina,et al.  Synchronizing a database to improve freshness , 2000, SIGMOD '00.

[5]  Nick Craswell,et al.  The impact of crawl policy on web search effectiveness , 2009, SIGIR.

[6]  Uri Schonfeld,et al.  Sitemaps: above and beyond the crawl of duty , 2009, WWW '09.

[7]  Alexandros Ntoulas,et al.  Effective Change Detection Using Sampling , 2002, VLDB.

[8]  Gilad Mishne,et al.  Towards recency ranking in web search , 2010, WSDM '10.

[9]  Sandeep Pandey,et al.  User-centric Web crawling , 2005, WWW '05.

[10]  George Cybenko,et al.  Keeping up with the changing Web , 2000, Computer.

[11]  Ravi Kumar,et al.  Efficient Discovery of Authoritative Resources , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[12]  Hector Garcia-Molina,et al.  Effective page refresh policies for Web crawlers , 2003, TODS.

[13]  Yun Chi,et al.  Monitoring RSS Feeds Based on User Browsing Pattern , 2007, ICWSM.

[14]  Minghai Liu,et al.  User browsing behavior-driven web crawling , 2011, CIKM '11.

[15]  Marc Najork,et al.  Web Crawling , 2010, Found. Trends Inf. Retr..

[16]  Sandeep Pandey,et al.  Crawl ordering by search impact , 2008, WSDM '08.

[17]  Laks V. S. Lakshmanan,et al.  Learning influence probabilities in social networks , 2010, WSDM '10.

[18]  Hector Garcia-Molina,et al.  Estimating frequency of change , 2003, TOIT.

[19]  Wei Chu,et al.  Refining Recency Search Results with User Click Feedback , 2011, ArXiv.

[20]  George Cybenko,et al.  How dynamic is the Web? , 2000, Comput. Networks.

[21]  Junghoo Cho,et al.  RankMass crawler: a crawler with high personalized pagerank coverage guarantee , 2007, VLDB 2007.

[22]  Sandeep Pandey,et al.  Recrawl scheduling based on information longevity , 2008, WWW.

[23]  Karn Richard Efficient Monitoring Algorithm for Fast News Alert , 2005 .

[24]  Serge Abiteboul,et al.  Adaptive on-line page importance computation , 2003, WWW '03.