A novel defense mechanism against web crawlers intrusion

Web robots also known as crawlers or spiders are used by search engines, hackers and spammers to gather information about web pages. Timely detection and prevention of unwanted crawlers increases privacy and security of websites. In this paper, a novel method to identify web crawlers is proposed to prevent unwanted crawler to access websites. This new method suggests Five-factor identification process to detect unwanted crawlers. This work provides the pretest and posttest results along with a systematic evaluation of web pages with the proposed identification technique versus web pages without the proposed identification process. The outputs of logistic regression analysis for both treatment and control groups are provided to evaluate hypotheses and to answer the research questions. An experiment is performed with repeated measures for two groups with each group containing the same web pages. The main goal of this work was to address the challenge of identifying and preventing unwanted web crawlers by proposing a novel defense mechanism with identification process.

[1]  Brian D. Davison,et al.  Detecting semantic cloaking on the web , 2006, WWW '06.

[2]  C. Lee Giles,et al.  The Ethicality of Web Crawlers , 2010, 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[3]  C. Lee Giles,et al.  Measuring the web crawler ethics , 2010, WWW '10.

[4]  Sabela Ramos,et al.  Java in the High Performance Computing arena: Research, practice and experience , 2013, Sci. Comput. Program..

[5]  Cheng-Fa Tsai A network processing model for address learning and IP recognition , 2002, Inf. Sci..

[6]  Juliana Freire,et al.  An adaptive crawler for locating hidden-Web entry points , 2007, WWW '07.

[7]  Yida Wang,et al.  iRobot: an intelligent crawler for web forums , 2008, WWW.

[8]  C. Lee Giles,et al.  A large-scale study of robots.txt , 2007, WWW '07.

[9]  Viv Bewick,et al.  Statistics review 14: Logistic regression , 2005, Critical care.

[10]  David C. Mowery,et al.  Is the Internet a US invention?—an economic and technological history of computer networking , 2002 .

[11]  David Eichmann,et al.  2 – Background : Agents in General and Spiders in Particular , 1994 .

[12]  Paul D Jeanne Ellis Ormrod Leedy,et al.  Practical Research: Planning and Design , 1974 .

[13]  Jayant Madhavan,et al.  Structured Data on the Web , 2009, 2010 12th International Asia-Pacific Web Conference.

[14]  Filippo Menczer,et al.  Evaluating topic-driven web crawlers , 2001, SIGIR '01.

[15]  Jun-Lin Lin Detection of cloaked web spam by using tag-based methods , 2009, Expert Syst. Appl..

[16]  Ralph Westfall,et al.  If your pearls of wisdom fall in a forest… , 2009, Commun. ACM.

[17]  Petros Zerfos,et al.  Downloading textual hidden web content through keyword queries , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[18]  Larry J. Stephens Advanced Statistics Demystified , 2004 .

[19]  Haining Wang,et al.  Surviving a search engine overload , 2012, WWW.

[20]  David G. Kleinbaum,et al.  Logistic Regression. A Self- Learning Text , 1994 .

[21]  Santanu Kolay A larger scale study of robots.txt , 2008, WWW.

[22]  James A. Whittaker,et al.  A software testing model: using design of experiments (doe) and logistic regression , 2001 .

[23]  Yang Sun A comprehensive study of the regulation and behavior of web crawlers , 2008 .

[24]  Marc Ehrig,et al.  Ontology-focused crawling of Web documents , 2003, SAC '03.

[25]  Marios D. Dikaiakos,et al.  Characterizing Crawler Behavior from Web Server Access Logs , 2003, EC-Web.

[26]  Barrie Sosinsky Networking Bible , 2009 .

[27]  Faye Fangfei Wang Domain names management and legal protection , 2006, Int. J. Inf. Manag..

[28]  B. B. Meshram,et al.  Focused web crawler with revisit policy , 2011, ICWET.

[29]  Vipin Kumar,et al.  Discovery of Web Robot Sessions Based on their Navigational Patterns , 2004, Data Mining and Knowledge Discovery.

[30]  Donald R. Cooper,et al.  Business Research Methods , 1980 .

[31]  John Langford,et al.  Telling humans and computers apart automatically , 2004, CACM.

[32]  Sourav S. Bhowmick,et al.  NEAR-Miner: Mining Evolution Associations of Web Site Directories for Efficient Maintenance of Web Archives , 2009, Proc. VLDB Endow..

[33]  Finn Kuusisto,et al.  Web search , 2012, XRDS.

[34]  Dmitri Loguinov,et al.  IRLbot: scaling to 6 billion pages and beyond , 2008, WWW.

[35]  Balachander Krishnamurthy,et al.  Key Differences Between HTTP/1.0 and HTTP/1.1 , 1999, Comput. Networks.

[36]  Brian J. Taylor,et al.  Causal discovery in social media using quasi-experimental designs , 2010, SOMA '10.

[37]  Berkant Barla Cambazoglu,et al.  Discovering URLs through user feedback , 2011, CIKM '11.

[38]  Anthony Steed,et al.  Networked Graphics - Building Networked Games and Virtual Environments , 2009 .

[39]  Jenny Edwards,et al.  An adaptive model for optimizing performance of an incremental web crawler , 2001, WWW '01.

[40]  Jayant Madhavan,et al.  Google's Deep Web crawl , 2008, Proc. VLDB Endow..

[41]  Raymond M. Lee,et al.  The SAGE handbook of online research methods , 2008 .

[42]  D. Campbell,et al.  EXPERIMENTAL AND QUASI-EXPERIMENT Al DESIGNS FOR RESEARCH , 2012 .

[43]  M. Bailey,et al.  Towards Community Standards for Ethical Behavior in Computer Security Research , 2009 .

[44]  Richard W Pew,et al.  Technology for Adaptive Aging , 2004 .

[45]  Anthony J. T. Lee,et al.  Mining Web navigation patterns with a path traversal graph , 2011, Expert Syst. Appl..

[46]  M. Koster,et al.  Robots in the Web : threat or treat ? , 1995, WWW Spring 1995.

[47]  Stephen Isaac,et al.  Handbook in research and evaluation : a collection of principles, methods, and strategies useful in the planning, design, and evaluation of studies in education and the behavioral sciences , 1971 .

[48]  Hector Garcia-Molina,et al.  Effective page refresh policies for Web crawlers , 2003, TODS.

[49]  Kang-Won Lee,et al.  Securing Web Service by Automatic Robot Detection , 2006, USENIX Annual Technical Conference, General Track.

[50]  Ashwin Machanavajjhala,et al.  An Analysis of Structured Data on the Web , 2012, Proc. VLDB Endow..

[51]  Paulo B. Góes,et al.  Business Intelligence and Analytics Education, and Program Development: A Unique Opportunity for the Information Systems Discipline , 2012, TMIS.

[52]  Mike Thelwall,et al.  Web crawling ethics revisited: Cost, privacy, and denial of service , 2006 .

[53]  Utku Kose,et al.  What is search engine optimization: SEO? , 2010 .

[54]  Michael L. Nelson,et al.  Evaluation of crawling policies for a web-repository crawler , 2006, HYPERTEXT '06.

[55]  Jithesh Sathyan Fundamentals of EMS, NMS and OSS/BSS , 2010 .

[56]  Stefan Savage,et al.  Cloak and dagger: dynamics of web search cloaking , 2011, CCS '11.

[57]  Craig E. Wills,et al.  Towards a Better Understanding of Web Resources and Server Responses for Improved Caching , 1999, Comput. Networks.

[58]  Bruce Ratner,et al.  Statistical and Machine-Learning Data Mining: Techniques for Better Predictive Modeling and Analysis of Big Data , 2011 .

[59]  Joseph A. DiVanna Thinking Beyond Technology: Creating New Value in Business , 2002 .

[60]  Susan Gauch,et al.  A Co-operative Web Services Paradigm for Supporting Crawlers , 2007, RIAO.

[61]  Sarma B. K. Vrudhula,et al.  Energy optimal speed control of a producer--consumer device pair , 2007, TECS.

[62]  Donovan A. McFarlane,et al.  Research in Organizations: Foundations and Methods of Inquiry , 2006 .

[63]  A. K. Sharma,et al.  A QIIIEP based domain specific hidden web crawler , 2011, ICWET.

[64]  Piero Fraternali,et al.  Tools and approaches for developing data-intensive Web applications: a survey , 1999, CSUR.

[65]  Robert E. Brown,et al.  Public Relations and the Social Web: How to Use Social Media and Web 2.0 in Communications , 2009 .

[66]  Kamel Rekab,et al.  Network intrusion detection using an innovative statistical approach , 2006 .

[67]  Eunjin Jung,et al.  A targeted web crawling for building malicious javascript collection , 2009, CIKM-DSMM.

[68]  Mohammad Zulkernine,et al.  Mitigating program security vulnerabilities: Approaches and challenges , 2012, CSUR.

[69]  Gautam Pant,et al.  Panorama: extending digital libraries with topical crawlers , 2004, Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, 2004..

[70]  Saverio Perugini,et al.  Symbolic links in the Open Directory Project , 2008, Inf. Process. Manag..

[71]  Swapna S. Gokhale,et al.  Web robot detection techniques: overview and limitations , 2010, Data Mining and Knowledge Discovery.

[72]  Jerri L. Ledford,et al.  SEO: Search Engine Optimization Bible , 2007 .

[73]  C. Lee Giles,et al.  Determining Bias to Search Engines from Robots.txt , 2007, IEEE/WIC/ACM International Conference on Web Intelligence (WI'07).

[74]  Padmini Srinivasan,et al.  Learning to crawl: Comparing classification schemes , 2005, TOIS.

[75]  R. Wood The Global Internet Economy , 2004 .

[76]  Anja Feldmann,et al.  Rate of Change and other Metrics: a Live Study of the World Wide Web , 1997, USENIX Symposium on Internet Technologies and Systems.

[77]  Cormac Herley,et al.  A robust link-translating proxy server mirroring the whole web , 2011 .

[78]  M. Feldman The Internet Revolution and the Geography of Innovation , 2002 .

[79]  Marios D. Dikaiakos,et al.  Web robot detection: A probabilistic reasoning approach , 2009, Comput. Networks.

[80]  S. Selvakumar,et al.  CRAYSE: design and implementation of efficient text search algorithm in a web crawler , 2010, SOEN.

[81]  Kristin L. Sainani,et al.  Logistic Regression , 2014, PM & R : the journal of injury, function, and rehabilitation.

[82]  Anália Lourenço,et al.  Catching web crawlers in the act , 2006, ICWE '06.

[83]  Sooun Lee,et al.  Client/server computing technology: A framework for feasibility analysis and implementation , 1995 .

[84]  Jose G. Ramirez,et al.  Analyzing and Interpreting Continuous Data Using JMP:: A Step-by-Step Guide , 2009 .

[85]  Laurette Pretorius,et al.  Towards an ethical analysis of the W3C Web services architecture model , 2010, 2010 Information Security for South Africa.

[86]  Renu Vig,et al.  Design of CORE: context ontology rule enhanced focused web crawler , 2009, ICAC3 '09.

[87]  Wilfred Ng,et al.  Web dynamics and their ramifications for the development of Web search engines , 2006, Comput. Networks.

[88]  Stphane Tuffry,et al.  Data Mining and Statistics for Decision Making , 2011 .

[89]  Sandra Schlotzhauer,et al.  Elementary Statistics Using JMP , 2007 .

[90]  Filippo Menczer,et al.  Topical web crawlers: Evaluating adaptive algorithms , 2004, TOIT.

[91]  Hassan Artail,et al.  A fast HTML web page change detection approach based on hashing and reducing the number of similarity computations , 2008, Data Knowl. Eng..