An Approach to Detect the Internet Water Army via Dirichlet Process Mixture Model Based GSP Algorithm

The Internet Water Army (IWA) brings a great threat on cyber security. How to accurately recognize the IWA has become a challenging research issue. Most work exploits the behavioral analysis to distinguish IWA and non-IWA. These approaches are mainly divided into categories: direct compute method and training learning method. The direct calculation method mainly relies on crawler, and makes multidimensional eigenvector to detect IWA. Nevertheless, it did not consider the behavior rules based on the time sequence, and just determine the user behavior by feather vector, so the results are not very accurate. The recognition rate also needs to be improved. The second method mainly relies on cluster approaches. However, cluster approaches require pre-determined the number of clustering, which will directly lead to the model over fitting and owe fitting because of inadequate unit number. In this paper we propose a sequential pattern approach based on DPMM for IWA identification. Firstly, we analyze the user behavior of potential IWA and get a feature vector of user behavior. Secondly, we use DPMM to get effective and accurate clustering results. Finally, we use the sequential pattern mining algorithms to detect navy accounts. Our clustering results with datasets come from Tianya forum show a very ideal consequence.

[1]  Zhang Weiwei Research on the Characters of Four Sequential Patterns Mining Algorithms , 2006 .

[2]  P. Müller,et al.  Bayesian Nonparametrics: An invitation to Bayesian nonparametrics , 2010 .

[3]  Erik B. Sudderth Graphical models for visual object recognition and tracking , 2006 .

[4]  Fei-Yue Wang,et al.  Hierarchical Dirichlet processes and their applications: a survey , 2011 .

[5]  M. Escobar,et al.  Bayesian Density Estimation and Inference Using Mixtures , 1995 .

[6]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[7]  Song Ze-feng Survey of sequential pattern mining , 2008 .

[8]  Xia Ming-bo Research on Sequential Pattern Mining Algorithms , 2006 .

[9]  Chong Wang,et al.  Variational Inference for the Nested Chinese Restaurant Process , 2009, NIPS.

[10]  C. Antoniak Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems , 1974 .

[11]  Srinivasan Venkatesh,et al.  Battling the Internet water army: Detection of hidden paid posters , 2011, 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2013).

[12]  G. Casella,et al.  Explaining the Gibbs Sampler , 1992 .

[13]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[14]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[15]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[16]  Chen Jia-jun Dirichlet Process and Its Applications in Natural Language Processing , 2009 .