Web robot detection: A probabilistic reasoning approach

In this paper, we introduce a probabilistic modeling approach for addressing the problem of Web robot detection from Web-server access logs. More specifically, we construct a Bayesian network that classifies automatically access log sessions as being crawler- or human-induced, by combining various pieces of evidence proven to characterize crawler and human behavior. Our approach uses an adaptive-threshold technique to extract Web sessions from access logs. Then, we apply machine learning techniques to determine the parameters of the probabilistic model. The resulting classification is based on the maximum posterior probability of all classes given the available evidence. We apply our method to real Web-server logs and obtain results that demonstrate the robustness and effectiveness of probabilistic reasoning for crawler detection.

[1]  N. Japkowicz Learning from Imbalanced Data Sets: A Comparison of Various Strategies * , 2000 .

[2]  Gary M. Weiss Mining with rarity: a unifying framework , 2004, SKDD.

[3]  Stan Matwin,et al.  Machine Learning for the Detection of Oil Spills in Satellite Radar Images , 1998, Machine Learning.

[4]  Balachander Krishnamurthy,et al.  Web protocols and practice , 2001 .

[5]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[6]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[7]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[8]  S. Mani,et al.  MENTOR: a Bayesian Model for prediction of mental retardation in newborns. , 1997, Research in developmental disabilities.

[9]  Jeff A. Bilmes,et al.  Dynamic Bayesian Multinets , 2000, UAI.

[10]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[11]  Virgílio A. F. Almeida,et al.  In search of invariants for e-business workloads , 2000, EC '00.

[12]  Vipin Kumar,et al.  Introduction to Data Mining, (First Edition) , 2005 .

[13]  Marios D. Dikaiakos,et al.  An investigation of web crawler behavior: characterization and metrics , 2005, Comput. Commun..

[14]  Virgílio A. F. Almeida,et al.  Analyzing Web Robots and Their Impact on Caching , 2001 .

[15]  Marcel Worring,et al.  Face detection by aggregated Bayesian network classifiers , 2001, Pattern Recognit. Lett..

[16]  Vipin Kumar,et al.  Discovery of Web Robot Sessions Based on their Navigational Patterns , 2004, Data Mining and Knowledge Discovery.

[17]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[18]  Sriram Raghavan,et al.  Searching the Web , 2001, ACM Trans. Internet Techn..

[19]  Kevin B. Korb,et al.  Seabreeze Prediction Using Bayesian Networks , 2001, PAKDD.

[20]  Ron Kohavi,et al.  Mining e-commerce data: the good, the bad, and the ugly , 2001, KDD '01.

[21]  Ron Kohavi,et al.  Supervised and Unsupervised Discretization of Continuous Features , 1995, ICML.

[22]  Taeho Jo,et al.  A Multiple Resampling Method for Learning from Imbalanced Data Sets , 2004, Comput. Intell..

[23]  Marios D. Dikaiakos,et al.  Characterizing Crawler Behavior from Web Server Access Logs , 2003, EC-Web.

[24]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[25]  Tom Fawcett,et al.  Analysis and Visualization of Classifier Performance: Comparison under Imprecise Class and Cost Distributions , 1997, KDD.

[26]  Jaideep Srivastava,et al.  Web usage mining: discovery and applications of usage patterns from Web data , 2000, SKDD.

[27]  Marios D. Dikaiakos,et al.  A distributed middleware infrastructure for personalized services , 2004, Comput. Commun..

[28]  Terry Caelli,et al.  Building Detection Using Bayesian Networks , 2000, Int. J. Pattern Recognit. Artif. Intell..

[29]  Richard E. Neapolitan,et al.  Learning Bayesian networks , 2007, KDD '07.

[30]  José Mira Mira,et al.  NasoNet, modeling the spread of nasopharyngeal cancer with networks of probabilistic events in discrete time , 2002, Artif. Intell. Medicine.

[31]  Josef Kittler,et al.  Application of a Bayesian Network in a GIS Based Decision Making System , 1998, Int. J. Geogr. Inf. Sci..