论文信息 - A Probabilistic Reasoning Approach for Discovering Web Crawler Sessions

A Probabilistic Reasoning Approach for Discovering Web Crawler Sessions

In this paper we introduce a probabilistic-reasoning approach to detect Web robots (crawlers) from human visitors of Web sites. Our approach employs a Naive Bayes network to classify the HTTP sessions of a Web-server access log as crawler or human induced. The Bayesian network combines various pieces of evidence that were shown to distinguish between crawler and human HTTP traffic. The parameters of the Bayesian network are determined with machine learning techniques, and the resulting classification is based on the maximum posterior probability of all classes, given the available evidence. Our method is applied on real Web logs and provides a classification accuracy of 95%. The high accuracy with which our system detects crawler sessions, proves the robustness and effectiveness of the proposed methodology.

Marios D. Dikaiakos | Athena Stassopoulou | M. Dikaiakos | A. Stassopoulou

[1] Judea Pearl,et al. Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[2] Thomas G. Dietterich. What is machine learning? , 2020, Archives of Disease in Childhood.

[3] Tom Fawcett,et al. Analysis and Visualization of Classifier Performance: Comparison under Imprecise Class and Cost Distributions , 1997, KDD.

[4] Jaideep Srivastava,et al. Web usage mining: discovery and applications of usage patterns from Web data , 2000, SKDD.

[5] Vipin Kumar,et al. Introduction to Data Mining, (First Edition) , 2005 .

[6] Marios D. Dikaiakos,et al. An investigation of web crawler behavior: characterization and metrics , 2005, Comput. Commun..

[7] Vipin Kumar,et al. Discovery of Web Robot Sessions Based on their Navigational Patterns , 2004, Data Mining and Knowledge Discovery.