Context-Aware Data Loss Prevention for Cloud Storage Services

With the wide adoption of hybrid cloud, there are many potential risks that need to be mitigated to ensure that the utilizations of services are at their optimal levels. One of the major risks that has garnered much attention is maintaining maximum security and confidentiality for sensitive information. Detecting sensitive content at near real-time and at cloud scale has become a critical first step for organizations to prevent data loss and comply with data privacy laws and regulations. Proactive detection raises security awareness at the early stage and thus can be used to govern how the information should be managed, protected, and utilized in the hybrid cloud. In contrast to traditional dictionary or policy-based approaches, we introduce a system that detects sensitive content by leveraging its semantic contextual information through various machine learning and deep learning techniques at different levels of granularity within the document, and is the first of its kind.

[1]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[2]  Rayid Ghani,et al.  A Machine Learning Based System for Semi-Automatically Redacting Documents , 2011, IAAI.

[3]  Wei Xu,et al.  Bidirectional LSTM-CRF Models for Sequence Tagging , 2015, ArXiv.

[4]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[5]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[6]  Marc'Aurelio Ranzato,et al.  Ensemble of Generative and Discriminative Techniques for Sentiment Analysis of Movie Reviews , 2014, ICLR.

[7]  Tong Zhang,et al.  Effective Use of Word Order for Text Categorization with Convolutional Neural Networks , 2014, NAACL.

[8]  David Sánchez,et al.  Detecting Sensitive Information from Textual Documents: An Information-Theoretic Approach , 2012, MDAI.

[9]  Xin Shuai,et al.  Loose tweets: an analysis of privacy leaks on twitter , 2011, WPES.

[10]  Jessica Staddon,et al.  Detecting privacy leaks using corpus-based association rules , 2008, KDD.

[11]  Rachel Greenstadt,et al.  Privacy Detective: Detecting Private Information and Collective Privacy Behavior in a Large Social Network , 2014, WPES.

[12]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[13]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[14]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.