Fast Detection of Transformed Data Leaks

The leak of sensitive data on computer systems poses a serious threat to organizational security. Statistics show that the lack of proper encryption on files and communications due to human errors is one of the leading causes of data loss. Organizations need tools to identify the exposure of sensitive data by screening the content in storage and transmission, i.e., to detect sensitive information being stored or transmitted in the clear. However, detecting the exposure of sensitive information is challenging due to data transformation in the content. Transformations (such as insertion and deletion) result in highly unpredictable leak patterns. In this paper, we utilize sequence alignment techniques for detecting complex data-leak patterns. Our algorithm is designed for detecting long and inexact sensitive data patterns. This detection is paired with a comparable sampling algorithm, which allows one to compare the similarity of two separately sampled sequences. Our system achieves good detection accuracy in recognizing transformed leaks. We implement a parallelized version of our algorithms in graphics processing unit that achieves high analysis throughput. We demonstrate the high multithreading scalability of our data leak detection method required by a sizable organization.

[1]  Wu-chun Feng,et al.  Performance characterization of data-intensive kernels on AMD Fusion architectures , 2012, Computer Science - Research and Development.

[2]  Valery Polyanovsky,et al.  Comparative analysis of the quality of a global algorithm and a local algorithm for alignment of two sequences , 2011, Algorithms for Molecular Biology.

[3]  George Varghese,et al.  Curing regular expressions matching algorithms from insomnia, amnesia, and acalculia , 2007, ANCS '07.

[4]  Boleslaw K. Szymanski,et al.  Sequence alignment for masquerade detection , 2008, Comput. Stat. Data Anal..

[5]  Giorgio Valle,et al.  CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment , 2008, BMC Bioinformatics.

[6]  Liu Yang,et al.  Improving NFA-Based Signature Matching Using Ordered Binary Decision Diagrams , 2010, RAID.

[7]  Wenke Lee,et al.  Gyrus: A Framework for User-Intent Monitoring of Text-based Networked Applications , 2014, NDSS.

[8]  Alfred V. Aho,et al.  Efficient string matching , 1975, Commun. ACM.

[9]  M. Preethi PRIVACY-PRESERVING DETECTION OF SENSITIVE DATA EXPOSURE , 2016 .

[10]  Matthew Caesar,et al.  Towards Practical Avoidance of Information Leakage in Enterprise Networks , 2011, HotSec.

[11]  William Enck,et al.  Preventing accidental data disclosure in modern operating systems , 2013, CCS.

[12]  Somesh Jha,et al.  Beyond Pattern Matching: A Concurrency Model for Stateful Deep Packet Inspection , 2014, CCS.

[13]  Kang Li,et al.  Privacy-Aware Collaborative Spam Filtering , 2009, IEEE Transactions on Parallel and Distributed Systems.

[14]  Stefano Giordano,et al.  Sampling Techniques to Accelerate Pattern Matching in Network Intrusion Detection Systems , 2010, 2010 IEEE International Conference on Communications.

[15]  Andrei Z. Broder,et al.  Identifying and Filtering Near-Duplicate Documents , 2000, CPM.

[16]  Tsern-Huei Lee,et al.  Using String Matching for Deep Packet Inspection , 2008, Computer.

[17]  Jing Zhang,et al.  Rapid Screening of Transformed Data Leaks with Efficient Algorithms and Parallel Computing , 2015, CODASPY.

[18]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[19]  Danfeng Yao,et al.  Data Leak Detection as a Service , 2012, SecureComm.

[20]  Patrick Crowley,et al.  Algorithms to accelerate multiple regular expressions matching for deep packet inspection , 2006, SIGCOMM 2006.

[21]  Jonathan S. Turner,et al.  Advanced algorithms for fast and scalable deep packet inspection , 2006, 2006 Symposium on Architecture For Networking And Communications Systems.

[22]  Scott E. Coull,et al.  On Measuring the Similarity of Network Hosts: Pitfalls, New Metrics, and Empirical Analyses , 2011, NDSS.

[23]  David Wetherall,et al.  Privacy oracle: a system for finding application leaks with black box differential testing , 2008, CCS.

[24]  Salim Hariri,et al.  DDSGA: A Data-Driven Semi-Global Alignment Approach for Detecting Masquerade Attacks , 2015, IEEE Transactions on Dependable and Secure Computing.

[25]  Mikhail J. Atallah,et al.  Secure and Efficient Outsourcing of Sequence Comparisons , 2012, ESORICS.

[26]  Sungryoul Lee,et al.  Kargus: a highly-scalable software-based intrusion detection system , 2012, CCS.

[27]  Robert S. Boyer,et al.  A fast string searching algorithm , 1977, CACM.

[28]  Vern Paxson,et al.  Bro: a system for detecting network intruders in real-time , 1998, Comput. Networks.

[29]  Dan Lin,et al.  Preventing Information Leakage from Indexing in the Cloud , 2010, 2010 IEEE 3rd International Conference on Cloud Computing.

[30]  Yuan Zhang,et al.  AppIntent: analyzing sensitive data transmission in android for privacy leakage detection , 2013, CCS.

[31]  David Brumley,et al.  BitShred: feature hashing malware for scalable triage and semantic analysis , 2011, CCS '11.

[32]  Hector Garcia-Molina,et al.  Data Leakage Detection , 2011, IEEE Transactions on Knowledge and Data Engineering.

[33]  Sameer Patil,et al.  Attire: conveying information exposure through avatar apparel , 2013, CSCW '13.

[34]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[35]  Vitaly Shmatikov,et al.  Towards Practical Privacy for Genomic Computation , 2008, 2008 IEEE Symposium on Security and Privacy (sp 2008).

[36]  Mikhail J. Atallah,et al.  A Randomized Algorithm for Approximate String Matching , 2001, Algorithmica.

[37]  George Varghese,et al.  Deterministic memory-efficient string matching algorithms for intrusion detection , 2004, IEEE INFOCOM 2004.

[38]  Elisa Bertino,et al.  Towards mechanisms for detection and prevention of data exfiltration by insiders: keynote talk paper , 2011, ASIACCS '11.

[39]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[40]  Martin Roesch,et al.  Snort - Lightweight Intrusion Detection for Networks , 1999 .

[41]  Weiguo Liu,et al.  Bio-sequence database scanning on a GPU , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[42]  Mikhail J. Atallah,et al.  A lower-variance randomized algorithm for approximate string matching , 2013, Inf. Process. Lett..

[43]  Dan Lin,et al.  Data protection models for service provisioning in the cloud , 2010, SACMAT '10.

[44]  Angelos D. Keromytis,et al.  iLeak: A Lightweight System for Detecting Inadvertent Information Leaks , 2010, 2010 European Conference on Computer Network Defense.

[45]  Yongchao Liu,et al.  CUDASW++: optimizing Smith-Waterman sequence database searches for CUDA-enabled graphics processing units , 2009, BMC Research Notes.

[46]  Kevin Borders,et al.  Quantifying Information Leaks in Outbound Web Traffic , 2009, 2009 30th IEEE Symposium on Security and Privacy.

[47]  Uzi Vishkin,et al.  Deterministic Sampling - A New Technique for Fast Pattern Matching , 1991, SIAM J. Comput..

[48]  Chetan Kalyan,et al.  Information leak detection in financial e-mails using mail pattern analysis under partial information , 2007 .

[49]  Jing Zhang,et al.  Rapid and parallel content screening for detecting transformed data exposure , 2015, 2015 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS).

[50]  Jon Crowcroft,et al.  Efficient sequence alignment of network traffic , 2006, IMC '06.

[51]  Shanshan Song,et al.  Collaborative Internet Worm Containment , 2005, IEEE Secur. Priv..

[52]  Fang Liu,et al.  Privacy-Preserving Scanning of Big Content for Sensitive Data Exposure with MapReduce , 2015, CODASPY.

[53]  Pankaj K. Agarwal,et al.  Streaming Algorithms for Extent Problems in High Dimensions , 2010, SODA '10.

[54]  David P. Woodruff,et al.  Coresets and sketches for high dimensional subspace approximation problems , 2010, SODA '10.

[55]  A Saritha,et al.  A system for detecting network intruders in real-time , 2016 .