Information retrieval in web crawling: A survey

In today's scenario, World Wide Web (WWW) is flooded with huge amount of information. Due to growing popularity of the internet, finding the meaningful information among billions of information resources on the WWW is a challenging task. The information retrieval (IR) provides documents to the end users which satisfy their need of information. Search engine is used to extract valuable information from the internet. Web crawler is the principal part of search engine; it is an automatic script or program which can browse the WWW in automatic manner. This process is known as web crawling. In this paper, review on strategies of information retrieval in web crawling has been presented that are classifying into four categories viz: focused, distributed, incremental and hidden web crawlers. Finally, on the basis of user customized parameters the comparative analysis of various IR strategies has been performed.

[1]  Komal Kumar Bhatia,et al.  Design and Implementation of Domain based Semantic Hidden Web Crawler , 2015, ArXiv.

[2]  Wanli Zuo,et al.  First-order focused crawling , 2007, WWW '07.

[3]  A.I. El-Desouky,et al.  An Automatic Label Extraction Technique for Domain-Specific Hidden Web Crawling (LEHW) , 2006, 2006 International Conference on Computer Engineering and Systems.

[4]  Nikita V. Mahajan,et al.  Keyword focused web crawler , 2015, 2015 2nd International Conference on Electronics and Communication Systems (ICECS).

[5]  Pradeep Kumar Sahoo,et al.  Deep iCrawl: An Intelligent Vision-Based Deep Web Crawler , 2011 .

[6]  A. K. Sharma,et al.  A QIIIEP based domain specific hidden web crawler , 2011, ICWET.

[7]  Ah Chung Tsoi,et al.  A Scalable Lightweight Distributed Crawler for Crawling with Limited Resources , 2008, 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[8]  Roi Blanco,et al.  Probabilistic static pruning of inverted files , 2010, TOIS.

[9]  Torsten Suel,et al.  Design and implementation of a high-performance distributed Web crawler , 2002, Proceedings 18th International Conference on Data Engineering.

[10]  B. B. Meshram,et al.  Focused web crawler with revisit policy , 2011, ICWET.

[11]  Rajashree Shettar,et al.  A novel approach to implement a shop bot on distributed web crawler , 2014, 2014 IEEE International Advance Computing Conference (IACC).

[12]  Yunming Ye,et al.  IglooG: A Distributed Web Crawler Based on Grid Service , 2005, APWeb.

[13]  Juliana Freire,et al.  Siphon++: a hidden-webcrawler for keyword-based interfaces , 2008, CIKM '08.

[14]  M. Sunil Kumar,et al.  Design and Implementation of Scalable, Fully Distributed Web Crawler for a Web Search Engine , 2011 .

[15]  Murat Can Ganiz,et al.  Intelligent focused crawler: Learning which links to crawl , 2011, 2011 International Symposium on Innovations in Intelligent Systems and Applications.

[16]  Bing Zhou,et al.  A distributed vertical crawler using crawling-period based strategy , 2010, 2010 2nd International Conference on Future Computer and Communication.

[17]  Dongmei Zhang,et al.  Hidden web crawling for SQL injection detection , 2010, 2010 3rd IEEE International Conference on Broadband Network and Multimedia Technology (IC-BNMT).

[18]  Ashutosh,et al.  Design of A Priority Based Frequency Regulated Incremental Crawler , 2014 .

[19]  Kalaiarasi Sonai Muthu Anbananthen,et al.  Focused Web Crawler , .

[20]  Hamed Fazlollahtabar,et al.  An Assessment Model for the State of Organizational Readiness Inservice Oriented architecture Implementation Based on Fuzzy Logic , 2014 .

[21]  Hector Garcia-Molina,et al.  The Evolution of the Web and Implications for an Incremental Crawler , 2000, VLDB.

[22]  Boon Thau Loo,et al.  Distributed Web Crawling over DHTs , 2004 .

[23]  Arputharaj Kannan,et al.  LSCrawler: A Framework for an Enhanced Focused Web Crawler Based on Link Semantics , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[24]  Jenny Edwards,et al.  An adaptive model for optimizing performance of an incremental web crawler , 2001, WWW '01.

[25]  Priya Anand,et al.  Focused web crawlers and its approaches , 2015, 2015 International Conference on Futuristic Trends on Computational Analysis and Knowledge Management (ABLAZE).

[26]  Smita Agrawal,et al.  Deep Web Crawler: A Review , 2013 .

[27]  Mukesh K. Mohania,et al.  A data-mining approach for optimizing performance of an incremental crawler , 2003, Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003).

[28]  Christian Callegari,et al.  Advances in Computing, Communications and Informatics (ICACCI) , 2015 .

[29]  Komal Kumar Bhatia,et al.  A Comparative Study of Hidden Web Crawlers , 2014, ArXiv.

[30]  Pranali Kale,et al.  Design and Implementation of Focused Web Crawler Using Genetic Algorithm , 2015 .

[31]  Valerio Schiavoni,et al.  UniCrawl: A Practical Geographically Distributed Web Crawler , 2015, 2015 IEEE 8th International Conference on Cloud Computing.

[32]  R. Sharma,et al.  An Adaptive, Selective and Incremental Web Crawler , 2015 .

[33]  Wei-Ying Ma,et al.  Incorporating site-level knowledge for incremental crawling of web forums: a list-wise strategy , 2009, KDD.

[34]  A. Garje,et al.  Realizing Peer-to-Peer and Distributed Web Crawler , 2012 .

[35]  Qiang Zhu An Algorithm OFC for the Focused Web Crawler , 2007, 2007 International Conference on Machine Learning and Cybernetics.

[36]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[37]  Ashutosh Dixit Self Adjusting Refresh Time Based Architecture for Incremental Web Crawler , 2008 .

[38]  Nikhil Gupta,et al.  Extraction of Query Interfaces for Domain-Specific Hidden web Crawler , 2014 .

[39]  Sebastiano Vigna,et al.  Trovatore: Towards a Highly Scalable Distributed Web Crawler , 2001, WWW Posters.

[40]  Komal Kumar Bhatia,et al.  AKSHR: A novel framework for a Domain-specific Hidden Web Crawler , 2010, 2010 First International Conference On Parallel, Distributed and Grid Computing (PDGC 2010).

[41]  Dunren Che,et al.  Improving Relevance Prediction for Focused Web Crawlers , 2012, 2012 IEEE/ACIS 11th International Conference on Computer and Information Science.

[42]  Dong-Hoon Choi,et al.  OGSA-DWC: A Middleware for Deep Web Crawling Using the Grid , 2008, 2008 IEEE Fourth International Conference on eScience.

[43]  Di Zou,et al.  Dist-RIA Crawler: A Distributed Crawler for Rich Internet Applications , 2013, 2013 Eighth International Conference on P2P, Parallel, Grid, Cloud and Internet Computing.

[44]  Weicheng Ma,et al.  Advanced Deep Web Crawler Based on Dom , 2012, 2012 Fifth International Joint Conference on Computational Sciences and Optimization.

[45]  Mohsen Sharifi,et al.  Availability and Accuracy of Distributed Web Crawlers: A Model-Based Evaluation , 2008, 2008 Second UKSIM European Symposium on Computer Modeling and Simulation.

[46]  Prasenjit Mitra,et al.  Clustering-based incremental web crawling , 2010, TOIS.

[47]  Komal Kumar Bhatia,et al.  A Framework for Incremental Hidden Web Crawler , 2010 .

[48]  José Rufino,et al.  Geographical partition for distributed web crawling , 2005, GIR '05.

[49]  Christos Bouras,et al.  CREATING A POLITE , ADAPTIVE AND SELECTIVE INCREMENTAL CRAWLER , 2005 .