An Overview of Web Robots Detection Techniques

Web robots or web crawlers have become the major source of web traffic. While some robots are well-behaving such as search engines, others can perform DDoS attacks, which put great threats on websites. Effectively detecting web robots will benefit not only for network traffic cleaning, but also for improving the cybersecurity of IoT enabled systems and services. To get the state of the arts in web robot detection, this paper reviews recent decade research on web robot or web robot/crawler detection techniques and compares their performances and identify the challenges of different techniques, thus providing researchers a reference for the development of web robots detection in real applications. To protect web content from malicious web robots, researchers have investigated various approaches, but they can be classified into three themes: offline web log analysis, honeypots and online robot detection. We conclude that off-line web log analysis methods have quite high accuracy, but they are time-consuming compared to online detection methods. Honeypots, as a computer security mechanism, can be used to engage and deceive hackers and identify malicious activities performed over the Internet, but they may block legitimate robots. The review shows that a hybrid method is better than an individual classifier, and the performance of online web robot detection needs to be improved. Also, different types of features could play different roles in different machine learning models. Therefore, feature selection is important for web robot/crawler detection.

[1]  S. B. Junaidu,et al.  AN ENHANCED INTRUSION DETECTION SYSTEM USING HONEYPOT AND CAPTCHA TECHNIQUES , 2019 .

[2]  Swapna S. Gokhale,et al.  Web robot detection techniques: overview and limitations , 2010, Data Mining and Knowledge Discovery.

[3]  Mironeanu Catalin,et al.  An efficient method in pre-processing phase of mining suspicious web crawlers , 2017, 2017 21st International Conference on System Theory, Control and Computing (ICSTCC).

[4]  Rajesh C. Dharmik,et al.  Study of Web Crawler and its Different Types , 2014 .

[5]  Marios D. Dikaiakos,et al.  Real-time web crawler detection , 2011, 2011 18th International Conference on Telecommunications.

[6]  Eul Gyu Im,et al.  Detection Method for Distributed Web-Crawlers: A Long-Tail Threshold Model , 2018, Secur. Commun. Networks.

[7]  Antonina Komarova,et al.  A study of different web-crawler behaviour , 2017, 2017 20th Conference of Open Innovations Association (FRUCT).

[8]  Sean F. McKenna,et al.  Detection and classification of Web robots with honeypots , 2016 .

[9]  Yue Li,et al.  PathMarker: protecting web contents against inside crawlers , 2019, Cybersecur..

[10]  Francesco Masulli,et al.  Bot recognition in a Web store: An approach based on unsupervised learning , 2020, J. Netw. Comput. Appl..

[11]  Theodoros Kostoulas,et al.  Towards a framework for detecting advanced Web bots , 2019, ARES.

[12]  Grazyna Suchacka,et al.  Detection of Internet robots using a Bayesian approach , 2015, 2015 IEEE 2nd International Conference on Cybernetics (CYBCONF).

[13]  Tshilidzi Marwala,et al.  Honey Pot: A Major Technique for Intrusion Detection , 2016 .

[14]  Francesco Masulli,et al.  Online Web Bot Detection Using a Sequential Classification Approach , 2018, 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS).

[15]  Yue Li,et al.  Protecting web contents against persistent distributed crawlers , 2017, 2017 IEEE International Conference on Communications (ICC).

[16]  Leo Mrsic,et al.  Lino - An Intelligent System for Detecting Malicious Web-Robots , 2015, ACIIDS.

[17]  Aijun An,et al.  Feature evaluation for web crawler detection with data mining techniques , 2012, Expert Syst. Appl..

[18]  Maria Ortiz de Zuniga,et al.  Web Crawler , 2009, Encyclopedia of Database Systems.

[19]  Derek Doran,et al.  A soft computing approach for benign and malicious web robot detection , 2017, Expert Syst. Appl..

[20]  Zigang Cao,et al.  Machine Learning Based CloudBot Detection Using Multi-Layer Traffic Statistics , 2019, 2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS).

[21]  Swapna S. Gokhale,et al.  An integrated method for real time and offline web robot detection , 2016, Expert Syst. J. Knowl. Eng..

[22]  Shady Elbassuoni,et al.  Website Navigation Behavior Analysis for Bot Detection , 2017, 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

[23]  Smita Jangale,et al.  Malicious Web Crawler Detection using Intrusion Detection System , 2016 .

[24]  Om Prakash Vyas,et al.  Agglomerative Approach for Identification and Elimination of Web Robots from Web Server Logs to Extract Knowledge about Actual Visitors , 2015 .

[25]  N. Algiriyage Offline analysis of web logs to identify offensive web crawlers. , 2017 .

[26]  Grigorios Tsoumakas,et al.  Web Robot Detection: A Semantic Approach , 2018, 2018 IEEE 30th International Conference on Tools with Artificial Intelligence (ICTAI).

[27]  Mianxiong Dong,et al.  Neural Network Based Web Log Analysis for Web Intrusion Detection , 2017, SpaCCS Workshops.