Detecting Data Semantic: A Data Leakage Prevention Approach

Data leakage prevention systems (DLPSs) are increasingly being implemented by organizations. Unlike standard security mechanisms such as firewalls and intrusion detection systems, DLPSs are designated systems used to protect in use, at rest and in transit data. DLPSs analytically use the content and surrounding context of confidential data to detect and prevent unauthorized access to confidential data. DLPSs that use content analysis techniques are largely dependent upon data fingerprinting, regular expressions, and statistical analysis to detect data leaks. Given that data is susceptible to change, data fingerprinting and regular expressions suffer from shortcomings in detecting the semantics of evolved confidential data. However, statistical analysis can manage any data that appears fuzzy in nature or has other variations. Thus, DLPSs with statistical analysis capabilities can approximate the presence of data semantics. In this paper, a statistical data leakage prevention (DLP) model is presented to classify data on the basis of semantics. This study contributes to the data leakage prevention field by using data statistical analysis to detect evolved confidential data. The approach was based on using the well-known information retrieval function Term Frequency-Inverse Document Frequency (TF-IDF) to classify documents under certain topics. A Singular Value Decomposition (SVD) matrix was also used to visualize the classification results. The results showed that the proposed statistical DLP approach could correctly classify documents even in cases of extreme modification. It also had a high level of precision and recall scores.

[1]  Aristide Fattori,et al.  Peering into the Muddy Waters of Pastebin , 2012, ERCIM News.

[2]  Rob Johnson,et al.  Text Classification for Data Loss Prevention , 2011, PETS.

[3]  Hyunsoo Kim,et al.  Dimension Reduction in Text Classification with Support Vector Machines , 2005, J. Mach. Learn. Res..

[4]  Geoffrey E. Hinton,et al.  Semantic hashing , 2009, Int. J. Approx. Reason..

[5]  Vassil Roussev,et al.  Data Fingerprinting with Similarity Digests , 2010, IFIP Int. Conf. Digital Forensics.

[6]  Lior Rokach,et al.  A Survey of Data Leakage Detection and Prevention Solutions , 2012, SpringerBriefs in Computer Science.

[7]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[8]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[9]  Gene H. Golub,et al.  Singular value decomposition and least squares solutions , 1970, Milestones in Matrix Computation.

[10]  Jonathan Oliver,et al.  TLSH -- A Locality Sensitive Hash , 2013, 2013 Fourth Cybercrime and Trustworthy Computing Workshop.

[11]  Asaf Shabtai,et al.  Content-based data leakage detection using extended fingerprinting , 2013, ArXiv.

[12]  Peter K. Pearson,et al.  Fast hashing of variable-length text strings , 1990, CACM.

[13]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[14]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[15]  Michael S. Bernstein,et al.  4chan and /b/: An Analysis of Anonymity and Ephemerality in a Large Online Community , 2011, ICWSM.