Random Forest Classifier based Scheduler Optimization for Search Engine Web Crawlers

The backbone of every search engine is the set of web crawlers, which go through all indexed web pages and update the search indexes with fresh copies, if there are changes. The crawling process provides optimum search results by keeping the indexes refreshed and up to date. This requires an "ideal scheduler" to crawl each web page immediately after a change occurs. Creating an optimum scheduler is possible when the web crawler has information about how often a particular change occurs. This paper discusses a novel methodology to determine the change frequency of a web page using machine learning and server scheduling techniques. The methodology has been evaluated with 3000+ web pages with various changing patterns. The results indicate how Information Access (IA) and Performance Gain (PG) are balanced out to zero in order to create an optimum crawling schedule for search engine indexing.

[1]  Adeel Anjum,et al.  Aiding web crawlers; projecting web page last modification , 2012, 2012 15th International Multitopic Conference (INMIC).

[2]  G. Hanumantha Rao,et al.  Web Search Engine , 2011 .

[3]  Victor Carneiro,et al.  Distributed and collaborative Web Change Detection system , 2015, Comput. Sci. Inf. Syst..

[4]  Sampath Jayarathna,et al.  Adaptive technique for web page change detection using multi-threaded crawlers , 2017, 2017 Seventh International Conference on Innovative Computing Technology (INTECH).

[5]  Tao Yang,et al.  Hybrid Indexing for Versioned Document Search with Cluster-based Retrieval , 2016, CIKM.

[6]  Ricardo A. Baeza-Yates,et al.  Scheduling algorithms for Web crawling , 2004, WebMedia and LA-Web, 2004. Proceedings.

[7]  Sandeep Pandey,et al.  Recrawl scheduling based on information longevity , 2008, WWW.

[8]  Marc Najork,et al.  A large‐scale study of the evolution of Web pages , 2004, Softw. Pract. Exp..

[9]  Carrie Grimes Microscale evolution of web pages , 2008, WWW.

[10]  B. B. Meshram,et al.  Focused web crawler with revisit policy , 2011, ICWET.

[11]  Yingjie Shi,et al.  Performance and energy efficiency of big data systems: characterization, implication and improvement , 2017, ICSCA '17.

[12]  Shigeki Hagihara,et al.  Web server access trend analysis based on the Poisson distribution , 2017, ICSCA '17.

[13]  Sampath Jayarathna,et al.  Change detection optimization in frequently changing web pages , 2017, 2017 Moratuwa Engineering Research Conference (MERCon).

[14]  Atul Patel,et al.  Web Crawler : Review of Different Types of Web Crawler, Its Issues, Applications and Research Opportunities , 2017 .

[15]  Carlos Castillo,et al.  Effective web crawling , 2005, SIGF.

[16]  Sampath Jayarathna,et al.  Detection of change frequency in web pages to optimize server-based scheduling , 2017, 2017 Seventeenth International Conference on Advances in ICT for Emerging Regions (ICTer).

[17]  Marc Najork,et al.  A large‐scale study of the evolution of Web pages , 2003, WWW '03.

[18]  C. Grimes,et al.  Keeping a Search Engine Index Fresh : Risk and optimality in estimating refresh rates for web pages , 2008 .

[19]  A. K. Sharma,et al.  Architecture for Parallel Crawling and Algorithm for Change Detection in Web Pages , 2007, 10th International Conference on Information Technology (ICIT 2007).

[20]  Zhen Liu,et al.  Optimal Robot Scheduling for Web Search Engines , 1998 .

[21]  Swati Mali Focused Web Crawler with Page Change Detection Policy , 2011 .

[22]  Susan T. Dumais,et al.  The web changes everything: understanding the dynamics of web content , 2009, WSDM '09.

[23]  Hector Garcia-Molina,et al.  Effective page refresh policies for Web crawlers , 2003, TODS.

[24]  Sriram Raghavan,et al.  Searching the Web , 2001, ACM Trans. Internet Techn..

[25]  Paul N. Bennett,et al.  Predicting content change on the web , 2013, WSDM.

[26]  Faryaneh Poursardar,et al.  Change detection and classification of digital collections , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[27]  Sampath Jayarathna,et al.  Optimizing change detection in distributed digital collections: An architectural perspective of change detection , 2017, 2017 18th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD).

[28]  Tamal Das,et al.  Sequence Estimation over Finite-State Markov Channel via the Expectation Maximization Algorithm , 2007 .

[29]  Marc Najork,et al.  Web Crawling , 2010, Found. Trends Inf. Retr..