A Weighted Context Graph Model for Fast Data Leak Detection

Data leakage prevention (DLP) uses a series of techniques to detect and prevent the sensitive data leakage caused by insider threat. Current detection methods either fail to achieve high accuracy toward transformed data or fail to reduce computational complexity. To ensure high detection accuracy and reduce computational complexity, we propose a Weighted Context Graph Model (WCGM) in this paper. The main goal of WCGM is three folds. First, the weighted context graph is proposed to build the contextual relation of data, based on which sub-graph matching method is used to calculate similarity features between tested data and pre-defined template. Second, machine learning algorithms are used to classify the tested data based on the similarity features of its context graphs. Third, privacy- preserving graph masking method is proposed to protect the data privacy of data holders. Extensive simulation results show that the proposed WCGM is able to achieve significant enhancement in terms of running time and accuracy.

[1]  Tina Eliassi-Rad,et al.  A Guide to Selecting a Network Similarity Method , 2014, SDM.

[2]  Vallipuram Muthukkumarasamy,et al.  Adaptable N-gram classification model for data leakage prevention , 2013, 2013, 7th International Conference on Signal Processing and Communication Systems (ICSPCS).

[3]  George A. Vouros,et al.  Summarization system evaluation revisited: N-gram graphs , 2008, TSLP.

[4]  Jing Zhang,et al.  Fast Detection of Transformed Data Leaks , 2016, IEEE Transactions on Information Forensics and Security.

[5]  Xiaojun Wu,et al.  Graph Regularized Nonnegative Matrix Factorization for Data Representation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Vallipuram Muthukkumarasamy,et al.  Word N-Gram Based Classification for Data Leakage Prevention , 2013, 2013 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications.

[7]  Constantine D. Spyropoulos,et al.  An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages , 2000, SIGIR '00.

[8]  Danfeng Yao,et al.  Data Leak Detection as a Service , 2012, SecureComm.

[9]  Harris Drucker,et al.  Support vector machines for spam categorization , 1999, IEEE Trans. Neural Networks.

[10]  Panos M. Pardalos,et al.  Quantification of network structural dissimilarities , 2017, Nature Communications.

[11]  M. Preethi PRIVACY-PRESERVING DETECTION OF SENSITIVE DATA EXPOSURE , 2016 .

[12]  Yuval Elovici,et al.  CoBAn: A context based model for data leakage prevention , 2014, Inf. Sci..

[13]  Rob Johnson,et al.  Text Classification for Data Loss Prevention , 2011, PETS.

[14]  Patrick Crowley,et al.  Algorithms to accelerate multiple regular expressions matching for deep packet inspection , 2006, SIGCOMM 2006.

[15]  Efstathios Stamatatos,et al.  Syntactic N-grams as machine learning features for natural language processing , 2014, Expert Syst. Appl..

[16]  Vallipuram Muthukkumarasamy,et al.  Detecting Data Semantic: A Data Leakage Prevention Approach , 2015, 2015 IEEE Trustcom/BigDataSE/ISPA.