Modelling website user behaviors by combining the EM and DBSCAN algorithms

Web logs can provide a wealth of information on user access patterns of a corresponding website, when they are properly analyzed. However, finding interesting patterns hidden in the low-level log data is non-trivial due to large log volumes, and the distribution of the log files in cluster environments. This paper presents a novel technique, the application of Density-Based Spatial Clustering of Applications with Noise (DBSCAN) and Expectation Maximization (EM) algorithms in an iterative manner for clustering web user sessions. Each cluster corresponds to one or more web user activities. The unique user access pattern of each cluster is identified by frequent pattern mining and sequential pattern mining techniques. When compared with the clustering output of EM, DBSCAN, and k-means algorithms, this technique shows better accuracy in web session mining, and it is more effective in identifying cluster changes with time. We demonstrate that the implemented system is capable of not only identifying common user behaviors, but also of identifying cyber-attacks.

[1]  Jian Pei,et al.  Mining Access Patterns Efficiently from Web Logs , 2000, PAKDD.

[2]  Jaideep Srivastava,et al.  Data Preparation for Mining World Wide Web Browsing Patterns , 1999, Knowledge and Information Systems.

[3]  James E. Pitkow,et al.  In Search of Reliable Usage Data on the WWW , 1997, Comput. Networks.

[4]  Hichem Frigui SyMP: an efficient clustering approach to identify clusters of arbitrary shapes in large data sets , 2002, KDD.

[5]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[6]  M. P. Sebastian,et al.  Improving the Accuracy and Efficiency of the k-means Clustering Algorithm , 2009 .

[7]  Sushmita Mitra,et al.  Web mining: a survey in the fuzzy framework , 2004, Fuzzy Sets Syst..

[8]  Terry R. Payne,et al.  Formal Specification of OWL-S with Object-Z: the Static Aspect , 2007 .

[9]  Jaideep Srivastava,et al.  Web usage mining: discovery and application of interesting patterns from web data , 2000 .

[10]  Ulrich Güntzer,et al.  Algorithms for association rule mining — a general survey and comparison , 2000, SKDD.

[11]  Brigitte Trousse,et al.  Advanced data preprocessing for intersites Web usage mining , 2004, IEEE Intelligent Systems.

[12]  Bernard J. Jansen,et al.  Search log analysis: What it is, what's been done, how to do it , 2006 .

[13]  Frederic Bartumeus,et al.  Expectation-Maximization Binary Clustering for Behavioural Annotation , 2015, PloS one.

[14]  Jaideep Srivastava,et al.  Web usage mining: discovery and applications of usage patterns from Web data , 2000, SKDD.

[15]  Julia Hirschberg,et al.  V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure , 2007, EMNLP.

[16]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[17]  Wang Tong,et al.  Web Log Mining by an Improved AprioriAll Algorithm , 2007 .

[18]  Jaideep Srivastava,et al.  Grouping Web page references into transactions for mining World Wide Web browsing patterns , 1997, Proceedings 1997 IEEE Knowledge and Data Engineering Exchange Workshop.

[19]  Pier Luca Lanzi,et al.  Mining interesting knowledge from weblogs: a survey , 2005, Data Knowl. Eng..

[20]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[21]  Tong Wang,et al.  Web Log Mining by an Improved AprioriAll Algorithm , 2005, WEC.

[22]  Andries Petrus Engelbrecht,et al.  Data clustering using particle swarm optimization , 2003, The 2003 Congress on Evolutionary Computation, 2003. CEC '03..

[23]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[24]  Chaofeng Li Research on Web Session Clustering , 2009, J. Softw..

[25]  J. Vellingiri,et al.  A Novel Technique for Web Log mining with Better Data Cleaning and Transaction Identification , 2011 .

[26]  PatternsYongjian,et al.  Clustering of Web Users Based on Access , 1999 .