Machine Learning Based CloudBot Detection Using Multi-Layer Traffic Statistics

With the rapid development of e-commerce services and online transactions, an increasing number of advanced web robots are utilized by speculators and hackers in underground economy to perform click fraud, register fake accounts and commit other kinds of frauds, seriously harming the profit of businesses and the fairness of online activities. There is solid evidence that the vast majority of such malicious bot traffic comes from data centers. The malicious bot deployed on the hosts of data centers is referred to as a CloudBot. How to detect and block CloudBots effectively has become an urgent problem in practice, while the research on it can be seldom seen in public. To this end, we propose a traffic-based quasi-real-time method for CloudBot detection using machine learning, which exploits a new sample partitioning approach, as well as innovative multi-layer features that reveal the essential difference between CloudBots and human traffic. Our method achieves 93.4% precision in the experiment and performs well on the real-world dataset, which proves to be effective to detect unknown CloudBots and combat the concept drift caused by varying time. Besides, the approach is also privacy-preserving without using any specific application layer information. We believe our work can benefit network economy security and fairness in practice.

[1]  Lars Schmidt-Thieme,et al.  Web Robot Detection - Preprocessing Web Logfiles for Robot Detection , 2005 .

[2]  Wolfgang Gaul,et al.  Frequent Generalized Subsequences — A Problem From Web Mining , 2000 .

[3]  Grazyna Suchacka,et al.  Detection of Internet robots using a Bayesian approach , 2015, 2015 IEEE 2nd International Conference on Cybernetics (CYBCONF).

[4]  Chin-Laung Lei,et al.  Identifying MMORPG Bots: A Traffic Analysis Approach , 2009, EURASIP J. Adv. Signal Process..

[5]  Eric Mayer Practical Packet Analysis Using Wireshark To Solve Real World Network Problems , 2016 .

[6]  Swapna S. Gokhale,et al.  Web robot detection techniques: overview and limitations , 2010, Data Mining and Knowledge Discovery.

[7]  Ondrej Rysavý,et al.  Towards identification of operating systems from the internet traffic: IPFIX monitoring with fingerprinting and clustering , 2014, 2014 5th International Conference on Data Communication Networking (DCNET).

[8]  Grigorios Tsoumakas,et al.  Web Robot Detection in Academic Publishing , 2017, ArXiv.

[9]  Vipin Kumar,et al.  Discovery of Web Robot Sessions Based on their Navigational Patterns , 2004, Data Mining and Knowledge Discovery.

[10]  Joseph W. Greene,et al.  Web robot detection in scholarly Open Access institutional repositories , 2016, Libr. Hi Tech.

[11]  Jan Vanthienen,et al.  Evaluation of Web Robot Discovery Techniques: A Benchmarking Study , 2006, ICDM.

[12]  Wei Luo,et al.  Traffic Identification in Big Internet Data , 2016 .

[13]  James E. Pitkow,et al.  Characterizing Browsing Strategies in the World-Wide Web , 1995, Comput. Networks ISDN Syst..

[14]  Derek Doran,et al.  Some (Non-)universal features of Web robot traffic , 2018, 2018 52nd Annual Conference on Information Sciences and Systems (CISS).

[15]  C. Lee Giles,et al.  Measuring the web crawler ethics , 2010, WWW '10.

[16]  Aziz Mohaisen,et al.  You are a Game Bot!: Uncovering Game Bots in MMORPGs via Self-similarity in the Wild , 2016, NDSS.