CoBAn: A context based model for data leakage prevention

A new context-based model (CoBAn) for accidental and intentional data leakage prevention (DLP) is proposed. Existing methods attempt to prevent data leakage by either looking for specific keywords and phrases or by using various statistical methods. Keyword-based methods are not sufficiently accurate since they ignore the context of the keyword, while statistical methods ignore the content of the analyzed text. The context-based approach we propose leverages the advantages of both these approaches. The new model consists of two phases: training and detection. During the training phase, clusters of documents are generated and a graph representation of the confidential content of each cluster is created. This representation consists of key terms and the context in which they need to appear in order to be considered confidential. During the detection phase, each tested document is assigned to several clusters and its contents are then matched to each cluster's respective graph in an attempt to determine the confidentiality of the document. Extensive experiments have shown that the model is superior to other methods in detecting leakage attempts, where the confidential information is rephrased or is different from the original examples provided in the learning set.

[1]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[2]  Kaspar Riesen,et al.  Approximate graph edit distance computation by means of bipartite graph matching , 2009, Image Vis. Comput..

[3]  Furu Wei,et al.  A document-sensitive graph model for multi-document summarization , 2010, Knowledge and Information Systems.

[4]  Thomas S. Huang,et al.  Graph Regularized Nonnegative Matrix Factorization for Data Representation. , 2011, IEEE transactions on pattern analysis and machine intelligence.

[5]  Letha H. Etzkorn,et al.  Predicting students' grades in computer science courses based on complexity measures of teacher's lecture notes , 2009 .

[6]  Hiroki Takakura,et al.  Toward a more practical unsupervised anomaly detection system , 2013, Inf. Sci..

[7]  José María Gómez Hidalgo,et al.  Data Leak Prevention through Named Entity Recognition , 2010, 2010 IEEE Second International Conference on Social Computing.

[8]  T. P. Cronan,et al.  Identifying factors that influence performance of non-computing majors in the business computer information systems course , 1989 .

[9]  Duminda Wijesekera,et al.  Scalable, graph-based network vulnerability analysis , 2002, CCS '02.

[10]  S. Sathiya Keerthi,et al.  Which Is the Best Multiclass SVM Method? An Empirical Study , 2005, Multiple Classifier Systems.

[11]  Frans Coenen,et al.  Text classification using graph mining-based feature extraction , 2010 .

[12]  M. Dhome,et al.  Inexact matching using neural networks , 1994 .

[13]  Margaret Anne Pierce,et al.  Attributional style as a predictor of success in a first computer science course , 1993 .

[14]  Constantine D. Spyropoulos,et al.  An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages , 2000, SIGIR '00.

[15]  William W. Cohen Learning Rules that Classify E-Mail , 1996 .

[16]  Harris Drucker,et al.  Support vector machines for spam categorization , 1999, IEEE Trans. Neural Networks.

[17]  Julie Beth Lovins,et al.  Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.

[18]  William J. Christmas,et al.  Structural Matching in Computer Vision Using Probabilistic Relaxation , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[19]  Dirk Cattrysse,et al.  Topic identification based on document coherence and spectral analysis , 2011, Inf. Sci..

[20]  Lior Rokach,et al.  M-score: estimating the potential damage of data leakage incident by assigning misuseability weight , 2010, Insider Threats '10.

[21]  Rami Puzis,et al.  Organization Mining Using Online Social Networks , 2013, Networks and Spatial Economics.

[22]  Ma Zhen-ping A New SVM Multiclass Classification Method , 2004 .

[23]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[24]  Luigi V. Mancini,et al.  A graph-based formalism for RBAC , 2002, TSEC.

[25]  Abraham Kandel,et al.  Graph-Theoretic Techniques for Web Content Mining , 2005, Series in Machine Perception and Artificial Intelligence.

[26]  Scott Fortin The Graph Isomorphism Problem , 1996 .

[27]  Julian R. Ullmann,et al.  An Algorithm for Subgraph Isomorphism , 1976, J. ACM.

[28]  Mohammad Reza Meybodi,et al.  Efficient stochastic algorithms for document clustering , 2013, Inf. Sci..

[29]  Joseph A. Cottam,et al.  Tutoring for retention , 2011, SIGCSE.

[30]  Wim De Mulder Optimal clustering in the context of overlapping cluster analysis , 2013, Inf. Sci..

[31]  Chih-Cheng Lien,et al.  Applying fuzzy decision tree to infer abnormal accessing of insurance customer data , 2011, 2011 Eighth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD).

[32]  Richard F. Deckro,et al.  M.B.A. ADMISSION CRITERIA AND ACADEMIC SUCCESS , 1977 .

[33]  Cynthia A. Phillips,et al.  A graph-based system for network-vulnerability analysis , 1998, NSPW '98.

[34]  Dragomir R. Radev,et al.  LexRank: Graph-based Lexical Centrality as Salience in Text Summarization , 2004, J. Artif. Intell. Res..

[35]  Daniel Shawcross Wilkerson,et al.  Winnowing: local algorithms for document fingerprinting , 2003, SIGMOD '03.

[36]  Chetan Kalyan,et al.  Information leak detection in financial e-mails using mail pattern analysis under partial information , 2007 .

[37]  Justin Zobel,et al.  Methods for Identifying Versioned and Plagiarized Documents , 2003, J. Assoc. Inf. Sci. Technol..

[38]  J. J. McGregor,et al.  Backtrack search algorithms and the maximal common subgraph problem , 1982, Softw. Pract. Exp..

[39]  Frans Coenen,et al.  Text Classification using Graph Mining-based Feature Extraction , 2010, SGAI Conf..

[40]  Duminda Wijesekera,et al.  Status-Based Access Control , 2008, TSEC.

[41]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[42]  Aric Hagberg,et al.  Exploring Network Structure, Dynamics, and Function using NetworkX , 2008, Proceedings of the Python in Science Conference.

[43]  Johan Hovold,et al.  Naive Bayes spam filtering using word-position-based attributes and length-sensitive classification thresholds , 2005, CEAS.

[44]  Reda Alhajj,et al.  Effectiveness of template detection on noise reduction and websites summarization , 2013, Inf. Sci..

[45]  Edwin R. Hancock,et al.  Genetic Search for Structural Matching , 1996, ECCV.

[46]  Jun Hu,et al.  Detecting and characterizing social spam campaigns , 2010, CCS '10.

[47]  Udi Manber,et al.  Finding Similar Files in a Large File System , 1994, USENIX Winter.

[48]  W. Bruce Croft,et al.  A language modeling approach to information retrieval , 1998, SIGIR '98.

[49]  Jun Zhao,et al.  Collective entity linking in web text: a graph-based method , 2011, SIGIR.

[50]  W. Bruce Croft,et al.  A general language model for information retrieval , 1999, CIKM '99.

[51]  Gerard Salton,et al.  A term weighting model based on utility theory , 1980, SIGIR '80.

[52]  Polina Zilberman,et al.  Analyzing group communication for preventing data leakage via email , 2011, Proceedings of 2011 IEEE International Conference on Intelligence and Security Informatics.

[53]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[54]  Lior Rokach,et al.  Detecting data misuse by applying context-based data linkage , 2010, Insider Threats '10.

[55]  Edwin R. Hancock,et al.  Inexact Graph Matching with Genetic Search , 1996, SSPR.

[56]  Sylvia L. Osborn,et al.  The role graph model and conflict of interest , 1999, TSEC.

[57]  W. Bruce Croft,et al.  Relevance-Based Language Models , 2001, SIGIR '01.

[58]  Jonathan Helfman,et al.  Ishmail: Immediate Identification of Important Information , 1995 .

[59]  Jessica Staddon,et al.  A content-driven access control system , 2008, IDtrust '08.

[60]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[61]  Jason D. M. Rennie ifile: An Application of Machine Learning to E-Mail Filtering , 2000 .

[62]  Yoram Singer,et al.  Context-sensitive learning methods for text categorization , 1996, SIGIR '96.

[63]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[64]  Gerard Salton,et al.  Another look at automatic text-retrieval systems , 1986, CACM.

[65]  Wei Wang,et al.  A Graph Based Approach Toward Network Forensics Analysis , 2008, TSEC.

[67]  Giovanni Vigna,et al.  STATL: An Attack Language for State-Based Intrusion Detection , 2002, J. Comput. Secur..

[68]  Hung Q. Ngo,et al.  A Data-Centric Approach to Insider Attack Detection in Database Systems , 2010, RAID.

[69]  G. Levi A note on the derivation of maximal common subgraphs of two directed or undirected graphs , 1973 .

[70]  Ingo Mierswa,et al.  YALE: rapid prototyping for complex data mining tasks , 2006, KDD '06.

[71]  Abraham Kandel,et al.  The hybrid representation model for web document classification , 2008 .

[72]  Lior Rokach,et al.  A Survey of Data Leakage Detection and Prevention Solutions , 2012, SpringerBriefs in Computer Science.

[73]  Kent E. Seamons,et al.  Content-triggered trust negotiation , 2004, TSEC.

[74]  Pasi Fränti,et al.  Clustering by analytic functions , 2012, Inf. Sci..