Detection of change frequency in web pages to optimize server-based scheduling

The Internet at present has become vast and dynamic with the ever increasing number of web pages. These web pages change when more content is added to them. With the availability of change detection and notification systems, keeping track of the changes occurring in web pages has become more simple and straightforward. However, most of these change detection and notification systems work based on predefined crawling schedules with static time intervals. This can become inefficient if there are no relevant changes being made to the web pages, resulting in the wastage of both temporal and computational resources. If the web pages are not crawled frequently, some of the important changes may be missed and there may be delays in notifying the subscribed users. This paper proposes a methodology to detect the frequency of change in web pages to optimize server-side scheduling of change detection and notification systems. The proposed method is based on a dynamic detection process, where the crawling schedule will be adjusted accordingly in order to result in a more efficient server-based scheduler to detect changes in web pages.

[1]  Ricardo A. Baeza-Yates,et al.  Scheduling algorithms for Web crawling , 2004, WebMedia and LA-Web, 2004. Proceedings.

[2]  Faryaneh Poursardar,et al.  Change detection and classification of digital collections , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[3]  Krzysztof Filipowski Comparison of Scheduling Algorithms for Domain Specific Web Crawler , 2014, 2014 European Network Intelligence Conference.

[4]  Marc Najork,et al.  A large‐scale study of the evolution of Web pages , 2004, Softw. Pract. Exp..

[5]  Victor Carneiro,et al.  Distributed and collaborative Web Change Detection system , 2015, Comput. Sci. Inf. Syst..

[6]  Sornalingam Nadaraj Distributed Content Aggregation & Content Change Detection using Bloom Filters , 2016 .

[7]  Carrie Grimes Microscale evolution of web pages , 2008, WWW.

[8]  A. K. Sharma,et al.  Change Detection in Web Pages , 2007, 10th International Conference on Information Technology (ICIT 2007).

[9]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.