Optimal Web Page Download Scheduling Policies for Green Web Crawling

A web crawler is responsible for discovering and downloading new pages on the Web as well as refreshing previously downloaded pages. During these operations, the crawler issues a large number of HTTP requests to web servers. These requests increase the energy consumption and carbon footprint of the web servers since computational resources are used while serving the requests. In this work, we introduce the problem of green web crawling, where the objective is to devise a page refresh policy that minimizes the total staleness of pages in the repository of a web crawler, subject to a constraint on the amount of carbon emissions due to the processing on web servers. For the case of one web server and one crawling thread, the optimal policy turns out to be a greedy one. At each iteration, the page to be refreshed is selected based on a metric that considers the page’s staleness, its size, and the greenness of the energy consumed at the web server premises. We then extend the optimal policy to the cases of 1)  many servers; 2)  multiple threads; and 3)  pages with variable freshness requirements. We conduct simulations on a real data set that involves a large web server collection hosting around two billion pages. We present experimental results for the optimal page refresh policy as well as for various heuristics, in an effort to study the effect of different factors on performance.

[1]  B. B. Cambazoglu,et al.  Web page download scheduling policies for green web crawling , 2014, International Conference on Software, Telecommunications and Computer Networks.

[2]  Paul N. Bennett,et al.  Predicting content change on the web , 2013, WSDM.

[3]  Donald F. Towsley,et al.  Optimal scheduling policies for a class of queues with customer deadlines to the beginning of service , 1988, JACM.

[4]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[5]  Niraj Tolia,et al.  Opportunities and challenges to unify workload, power, and cooling management in data centers , 2010, OPSR.

[6]  Margaret Martonosi,et al.  Managing the cost, energy consumption, and carbon footprint of internet services , 2010, SIGMETRICS '10.

[7]  Lachlan L. H. Andrew,et al.  Geographical load balancing with renewables , 2011, PERV.

[8]  Tajana Rosing,et al.  Utilizing green energy prediction to schedule mixed batch and service jobs in data centers , 2011, OPSR.

[9]  Jenny Edwards,et al.  An adaptive model for optimizing performance of an incremental web crawler , 2001, WWW '01.

[10]  Klara Nahrstedt,et al.  Energy-efficient soft real-time CPU scheduling for mobile multimedia systems , 2003, SOSP '03.

[11]  Sandeep Pandey,et al.  Recrawl scheduling based on information longevity , 2008, WWW.

[12]  Philip S. Yu,et al.  Optimal crawling strategies for web search engines , 2002, WWW '02.

[13]  Dan Xu,et al.  Geographic trough filling for internet datacenters , 2011, 2012 Proceedings IEEE INFOCOM.

[14]  Berkant Barla Cambazoglu,et al.  Scalability Challenges in Web Search Engines , 2015, Advanced Topics in Information Retrieval.

[15]  Athanasios Sfetsos,et al.  A comparison of various forecasting techniques applied to mean hourly wind speed time series , 2000 .

[16]  Elif Uysal-Biyikoglu,et al.  Energy-efficient scheduling of packet transmissions over wireless networks , 2002, Proceedings.Twenty-First Annual Joint Conference of the IEEE Computer and Communications Societies.

[17]  Prasenjit Mitra,et al.  Clustering-based incremental web crawling , 2010, TOIS.

[18]  R. Srikant,et al.  Scheduling Real-Time Traffic With Deadlines over a Wireless Channel , 1999, WOWMOM '99.

[19]  Qiushuang Chen,et al.  Dynamic Placement of Virtual Machines with Both Deterministic and Stochastic Demands for Green Cloud Computing , 2014 .

[20]  Aditya Dua,et al.  Downlink Wireless Packet Scheduling with Deadlines , 2007, IEEE Transactions on Mobile Computing.

[21]  Rajiv Ranjan,et al.  Survey of Techniques and Architectures for Designing Energy-Efficient Data Centers , 2016, IEEE Systems Journal.

[22]  Hector Garcia-Molina,et al.  Effective page refresh policies for Web crawlers , 2003, TODS.

[23]  Jie Wu,et al.  Energy efficient virtual machine placement algorithm with balanced and improved resource utilization in a data center , 2013, Math. Comput. Model..

[24]  Leandros Tassiulas,et al.  QoS provisioning for real-time traffic in wireless packet networks , 2002, Global Telecommunications Conference, 2002. GLOBECOM '02. IEEE.

[25]  Anand Sivasubramaniam,et al.  Carbon-Aware Energy Capacity Planning for Datacenters , 2012, 2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[26]  George Koutitas,et al.  Dynamic virtual machine allocation in cloud server facility systems with renewable energy sources , 2013, 2013 IEEE International Conference on Communications (ICC).

[27]  Robert Shorten,et al.  Stratus: Load Balancing the Cloud for Carbon Emissions Control , 2013, IEEE Transactions on Cloud Computing.

[28]  Sandeep Pandey,et al.  User-centric Web crawling , 2005, WWW '05.

[29]  Yang Xiao,et al.  A Survey of Energy-Efficient Scheduling Mechanisms in Sensor Networks , 2006, Mob. Networks Appl..