Automated big security text pruning and classification

Many security related big data problems, including document, traffic, and system log analysis require analysis of unstructured text. Consider the task of analyzing company documents for secure storage. Some might be too sensitive to put on a public cloud and require private storage with associated backup overhead, some may safe on the cloud in encrypted form, and some may be sufficiently non-sensitive to be stored on the cloud in plain-text without encryption and decryption overhead. Being able to make such categorizations autonomously can significantly strengthen data security, organization, and storage efficiency. In this paper, we analyze several base machine learning based security risk assessment algorithms and develop techniques to improve upon standard algorithms. In particular, we examine labeling document sensitivity, labeling each paragraph in the document with one of three levels of security risk. For evaluation, we use real sensitive texts, from documents leaked by the WikiLeaks organization. We improve upon the base models using probabilistic topic modeling via Latent Dirichlet Analysis to identify samples from impure subtopics in the training set, prior to training a logistic regression classifier.

[1]  Hiroshi Fujinoki,et al.  A Survey: Recent Advances and Future Trends in Honeypot Research , 2012 .

[2]  Xue-wen Chen,et al.  Pruning support vectors for imbalanced data classification , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[3]  Anne Holbrook,et al.  Views on health information sharing and privacy from primary care practices using electronic medical records , 2011, Int. J. Medical Informatics.

[4]  C. Edward Chow,et al.  Automated big text security classification , 2016, 2016 IEEE Conference on Intelligence and Security Informatics (ISI).

[5]  David B. Dunson,et al.  Probabilistic topic models , 2011, KDD '11 Tutorials.

[6]  Karen Kent,et al.  Guide to Computer Security Log Management , 2006 .

[7]  M. Verleysen,et al.  Classification in the Presence of Label Noise: A Survey , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[8]  Justin Grimmer,et al.  A Bayesian Hierarchical Topic Model for Political Texts: Measuring Expressed Agendas in Senate Press Releases , 2010, Political Analysis.

[9]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[10]  Yuval Elovici,et al.  CoBAn: A context based model for data leakage prevention , 2014, Inf. Sci..

[11]  José María Gómez Hidalgo,et al.  Data Leak Prevention through Named Entity Recognition , 2010, 2010 IEEE Second International Conference on Social Computing.

[12]  Gerhard Paass,et al.  Improved Phishing Detection using Model-Based Features , 2008, CEAS.

[13]  P. Mell,et al.  The NIST Definition of Cloud Computing , 2011 .

[14]  Bing Liu,et al.  Topic Modeling using Topics from Many Domains, Lifelong Learning and Big Data , 2014, ICML.

[15]  Alfredo Cuzzocrea,et al.  Privacy and Security of Big Data: Current Challenges and Future Research Perspectives , 2014, PSBD '14.

[16]  Randy H. Katz,et al.  A view of cloud computing , 2010, CACM.

[17]  Peter P. Swire Privacy and Information Sharing in the War on Terrorism , 2006 .

[18]  Elisa Bertino,et al.  Big Data - Security and Privacy , 2015, 2015 IEEE International Congress on Big Data.

[19]  Shih-Kun Huang,et al.  Web application security assessment by fault injection and behavior monitoring , 2003, WWW '03.

[20]  Pietro Perona,et al.  Pruning training sets for learning of object categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[21]  Thomas Zimmermann,et al.  Security Trend Analysis with CVE Topic Models , 2010, 2010 IEEE 21st International Symposium on Software Reliability Engineering.

[22]  Yang Gao,et al.  Towards Topic Modeling for Big Data , 2014, ArXiv.

[23]  Max Welling,et al.  Fast collapsed gibbs sampling for latent dirichlet allocation , 2008, KDD.

[24]  Alfredo Cuzzocrea Proceedings of the First International Workshop on Privacy and Secuirty of Big Data , 2014, CIKM 2014.

[25]  Vallipuram Muthukkumarasamy,et al.  A Semantics-Aware Classification Approach for Data Leakage Prevention , 2014, ACISP.

[26]  Rob Johnson,et al.  Text Classification for Data Loss Prevention , 2011, PETS.