An effective approach to enhancing a focused crawler using Google

In this paper, we share our experience in augmenting a focused crawler of our vertical search engine designed to work with academic slides. The goal of the focused crawler was to collect Microsoft PowerPoint files from academic institutions. A previous approach based on a general web crawler can fail to collect a sufficient number of files mainly because of the robots exclusion protocol and missing hyperlinks. As a remedy to these problems, we propose a combinatory approach in which the indexing information maintained by a general web search engine such as Google is utilized for target URL list generation through our query generator, further then complemented by our URL extractor and file downloader. Because Google has already crawled billions of web pages, it will be more cost-efficient and potentially effective to systematically retrieve the desired information from Google than to redo crawling from scratch by ourselves. Our focused crawler, which we call SlideCrawler, has been used for our vertical search engine CourseShare since the fall of 2011. The capability of SlideCrawler was verified for the top-500 world wide universities. SlideCrawler collected about one million files from the top-500 universities. Further, the study results show that SlideCrawler outperforms Nutch, collecting 3.7 times more slide files.

[1]  Chunxia Yin,et al.  A Novel Method for Crawler in Domain-specific Search , 2010 .

[2]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[3]  Ari Pirkola Focused Crawling : A Means to Acquire Biological Data from the Web , 2007 .

[4]  Filippo Menczer,et al.  Topical web crawlers: Evaluating adaptive algorithms , 2004, TOIT.

[5]  Filippo Menczer,et al.  Crawling the Web , 2004, Web Dynamics.

[6]  Sebastiano Vigna,et al.  UbiCrawler: a scalable fully distributed Web crawler , 2004, Softw. Pract. Exp..

[7]  Juliana Freire,et al.  Finding seeds to bootstrap focused crawlers , 2015, World Wide Web.

[8]  Soumen Chakrabarti,et al.  Accelerated focused crawling through online relevance feedback , 2002, WWW.

[9]  Quan Z. Sheng,et al.  ThingSeek: A Crawler and Search Engine for the Internet of Things , 2016, SIGIR.

[10]  Hector Garcia-Molina,et al.  Synchronizing a database to improve freshness , 2000, SIGMOD '00.

[11]  Emin Islam Tatli,et al.  WIVET - Benchmarking Coverage Qualities of Web Crawlers , 2017, Comput. J..

[12]  Wookey Lee,et al.  Mobile Web Navigation in Digital Ecosystems Using Rooted Directed Trees , 2011, IEEE Transactions on Industrial Electronics.

[13]  Hector Garcia-Molina,et al.  The Evolution of the Web and Implications for an Incremental Crawler , 2000, VLDB.

[14]  Hai Jin,et al.  SmartCrawler: A Two-Stage Crawler for Efficiently Harvesting Deep-Web Interfaces , 2016, IEEE Transactions on Services Computing.

[15]  Anthony Bonato,et al.  The robot crawler graph process , 2018, Discret. Appl. Math..

[16]  Michael Chau,et al.  Comparison of Three Vertical Search Spiders , 2003, Computer.

[17]  Hanêne Ben-Abdallah,et al.  FC4CD: a new SOA-based Focused Crawler for Cloud service Discovery , 2018, Computing.

[18]  Jenny Edwards,et al.  An adaptive model for optimizing performance of an incremental web crawler , 2001, WWW '01.

[19]  Torsten Suel,et al.  Design and implementation of a high-performance distributed Web crawler , 2002, Proceedings 18th International Conference on Data Engineering.

[20]  Marc Najork,et al.  Mercator: A scalable, extensible Web crawler , 1999, World Wide Web.

[21]  Jon M. Kleinberg,et al.  Small-World Phenomena and the Dynamics of Information , 2001, NIPS.

[22]  Deepak Singh Tomar,et al.  Effective Focused Crawling Based on Content and Link Structure Analysis , 2009, ArXiv.

[23]  Amy Nicole Langville,et al.  Google's PageRank and beyond - the science of search engine rankings , 2006 .

[24]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[25]  John Gantz,et al.  The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East , 2012 .

[26]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.