SEAL: Storage-efficient Causality Analysis on Enterprise Logs with Query-friendly Compression

Causality analysis automates attack forensic and facilitates behavioral detection by associating causally related but temporally distant system events. Despite its proven usefulness, the analysis suffers from the innate big data challenge to store and process a colossal amount of system events that are constantly collected from hundreds of thousands of end-hosts in a realistic network. In addition, the effectiveness of the analysis to discover security breaches relies on the assumption that comprehensive historical events over a long span are stored. Hence, it is imminent to address the scalability issue in order to make causality analysis practical and applicable to the enterprise-level environment. In this work, we present SEAL, a novel data compression approach for causality analysis. Based on information-theoretic observations on system event data, our approach achieves lossless compression and supports near real-time retrieval of historic events. In the compression step, the causality graph induced by the system logs is investigated, and abundant edge reduction potentials are explored. In the query step, for maximal speed, decompression is opportunistically executed. Experiments on two real-world datasets show that SEAL offers 2.63x and 12.94x data size reduction, respectively. Besides, 89% of the queries are faster on the compressed dataset than the uncompressed one, and SEAL returns exactly the same query results as the uncompressed data.

[1]  Somesh Jha,et al.  Kernel-Supported Cost-Effective Audit Logging for Causality Tracking , 2018, USENIX Annual Technical Conference.

[2]  Anja Feldmann,et al.  Delta encoding in HTTP , 2002, RFC.

[3]  Mostafa A. Bassiouni,et al.  Data Compression in Scientific and Statistical Databases , 1985, IEEE Transactions on Software Engineering.

[4]  Fengyuan Xu,et al.  High Fidelity Data Reduction for Big Data Security Dependency Analyses , 2016, CCS.

[5]  Feifei Li,et al.  DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning , 2017, CCS.

[6]  Fei Wang,et al.  MPI: Multiple Perspective Attack Investigation with Semantic Aware Execution Partitioning , 2017, USENIX Security Symposium.

[7]  Edo Liberty,et al.  Estimating Sizes of Social Networks via Biased Sampling , 2014, Internet Math..

[8]  Peng Gao,et al.  AIQL: Enabling Efficient Attack Investigation from System Monitoring Data , 2018, USENIX Annual Technical Conference.

[9]  Thomas Moyer,et al.  Towards Scalable Cluster Auditing through Grammatical Inference over Provenance Graphs , 2018, NDSS.

[10]  David M. Eyers,et al.  Runtime Analysis of Whole-System Provenance , 2018, CCS.

[11]  Ari Juels,et al.  PillarBox: Combating Next-Generation Malware with Fast Forward-Secure Logging , 2014, RAID.

[12]  Justin Pearson,et al.  Comma-free codes , 2003 .

[13]  V. N. Venkatakrishnan,et al.  POIROT: Aligning Attack Behavior with Kernel Audit Records for Cyber Threat Hunting , 2019, CCS.

[14]  Uriel Feige,et al.  On sums of independent random variables with unbounded variance, and estimating the average degree in a graph , 2004, STOC '04.

[15]  GraphOded Goldrei On Estimating the Average Degree of a , 2007 .

[16]  Peng Gao,et al.  A Query System for Efficiently Investigating Complex Attack Behaviors for Enterprise Security , 2019, Proc. VLDB Endow..

[17]  Thomas Moyer,et al.  Transparent Web Service Auditing via Network Provenance Functions , 2017, WWW.

[18]  Bo Li,et al.  Enabling Reconstruction of Attacks on Users via Efficient Browsing Snapshots , 2017, NDSS.

[19]  Yinjun Wu,et al.  A Survey of Bitmap Index Compression Algorithms for Big Data , 2015 .

[20]  Xin Wang,et al.  Query preserving graph compression , 2012, SIGMOD Conference.

[21]  Peter Deutsch,et al.  GZIP file format specification version 4.3 , 1996, RFC.

[22]  Zhou Li,et al.  Lens on the Endpoint: Hunting for Malicious Software Through Endpoint Data Analysis , 2017, RAID.

[23]  Somesh Jha,et al.  MCI : Modeling-based Causality Inference in Audit Logging for Attack Investigation , 2018, NDSS.

[24]  Peng Gao,et al.  SAQL: A Stream-based Query System for Real-Time Abnormal System Behavior Detection , 2018, USENIX Security Symposium.

[25]  Robert Wrembel,et al.  RLH: bitmap compression technique based on run-length and huffman encoding , 2007, DOLAP '07.

[26]  Prashant Doshi,et al.  GrAALF: Supporting Graphical Analysis of Audit Logs for Forensics , 2019, ArXiv.

[27]  Wajih Ul Hassan,et al.  Custos: Practical Tamper-Evident Auditing of Operating Systems Using Trusted Execution , 2020, NDSS.

[28]  Leman Akoglu,et al.  Fast Memory-efficient Anomaly Detection in Streaming Heterogeneous Graphs , 2016, KDD.

[29]  M. H. Hansen,et al.  On the Theory of Sampling from Finite Populations , 1943 .

[30]  Trent Jaeger,et al.  Taming the Costs of Trustworthy Provenance through Policy Reduction , 2017, ACM Trans. Internet Techn..

[31]  Bryan Cantrill,et al.  Dynamic Instrumentation of Production Systems , 2004, USENIX Annual Technical Conference, General Track.

[32]  Bo Li,et al.  JSgraph: Enabling Reconstruction of Web Attacks via Efficient Tracking of Live In-Browser JavaScript Executions , 2018, NDSS.

[33]  Hong Jiang,et al.  CABdedupe: A Causality-Based Deduplication Performance Booster for Cloud Backup Services , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[34]  Goetz Graefe,et al.  Data compression and database performance , 1991, [Proceedings] 1991 Symposium on Applied Computing.

[35]  Thomas Moyer,et al.  Trustworthy Whole-System Provenance for the Linux Kernel , 2015, USENIX Security Symposium.

[36]  Samuel T. King,et al.  Backtracking intrusions , 2003, SOSP '03.

[37]  Franco Scarselli,et al.  Inside PageRank , 2005, TOIT.

[38]  Ding Li,et al.  NoDoze: Combatting Threat Alert Fatigue with Automated Provenance Triage , 2019, NDSS.

[39]  William K. Robertson,et al.  Beehive: large-scale log analysis for detecting suspicious activity in enterprise networks , 2013, ACSAC.

[40]  Latifur Khan,et al.  SGX-Log: Securing System Logs With SGX , 2017, AsiaCCS.

[41]  Insup Lee,et al.  LogSafe: Secure and Scalable Data Logger for IoT Devices , 2018, 2018 IEEE/ACM Third International Conference on Internet-of-Things Design and Implementation (IoTDI).

[42]  Garret Swart,et al.  How to wring a table dry: entropy compression of relations and querying of compressed relations , 2006, VLDB.

[43]  Fei Wang,et al.  HERCULE: attack story reconstruction via community discovery on correlated log graph , 2016, ACSAC.

[44]  Konstantinos Markantonakis,et al.  EmLog: Tamper-Resistant System Logging for Constrained Devices with TEEs , 2017, WISTP.

[45]  Dana Ron,et al.  Approximating average parameters of graphs , 2008, Random Struct. Algorithms.

[46]  V. N. Venkatakrishnan,et al.  SLEUTH: Real-time Attack Scenario Reconstruction from COTS Audit Data , 2018, USENIX Security Symposium.

[47]  Subbarayan Venkatesan,et al.  Forensic analysis of file system intrusions using improved backtracking , 2005, Third IEEE International Workshop on Information Assurance (IWIA'05).

[48]  Michael Stonebraker,et al.  C-Store: A Column-oriented DBMS , 2005, VLDB.

[49]  Daniel J. Abadi,et al.  Column oriented Database Systems , 2009, Proc. VLDB Endow..

[50]  Mu Zhang,et al.  NodeMerge: Template Based Efficient Data Reduction For Big-Data Causality Analysis , 2018, CCS.

[51]  Robert Wrembel,et al.  RLH: Bitmap compression technique based on run-length and Huffman encoding , 2009, Inf. Syst..

[52]  Daniel J. Abadi,et al.  Scalable Pattern Matching over Compressed Graphs via Dedensification , 2016, KDD.

[53]  Mu Zhang,et al.  Towards a Timely Causality Analysis for Enterprise Security , 2018, NDSS.

[54]  Jiyong Jang,et al.  Threat Intelligence Computing , 2018, CCS.

[55]  Yu Wen,et al.  Log2vec: A Heterogeneous Graph Embedding Based Approach for Detecting Cyber Threats within Enterprise , 2019, CCS.

[56]  Xiangyu Zhang,et al.  Lprov: Practical Library-aware Provenance Tracing , 2018, ACSAC.

[57]  V. N. Venkatakrishnan,et al.  HOLMES: Real-Time APT Detection through Correlation of Suspicious Information Flows , 2018, 2019 IEEE Symposium on Security and Privacy (SP).

[58]  Patrick E. O'Neil,et al.  Model 204 Architecture and Performance , 1987, HPTS.

[59]  Eyal de Lara,et al.  The taser intrusion recovery system , 2005, SOSP '05.

[60]  Xiangyu Zhang,et al.  LogGC: garbage collecting audit log , 2013, CCS.

[61]  Zhou Li,et al.  Detection of Early-Stage Enterprise Infection by Mining Large-Scale Log Data , 2014, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[62]  V. N. Venkatakrishnan,et al.  ProPatrol: Attack Investigation via Extracted High-Level Tasks , 2018, ICISS.

[63]  Robert B. Ross,et al.  Lightweight Provenance Service for High-Performance Computing , 2017, 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[64]  Arie Shoshani,et al.  Optimizing bitmap indices with efficient compression , 2006, TODS.

[65]  Zhou Li,et al.  MADE: Security Analytics for Enterprise Threat Detection , 2018, ACSAC.

[66]  Jian Pei,et al.  Neighbor query friendly compression of social networks , 2010, KDD.

[67]  Xiangyu Zhang,et al.  ProTracer: Towards Practical Provenance Tracing by Alternating Between Logging and Tainting , 2016, NDSS.

[68]  David M. Eyers,et al.  Practical whole-system provenance capture , 2017, SoCC.

[69]  Gordon V. Cormack,et al.  Data compression on a database system , 1985, CACM.

[70]  R. Sekar,et al.  Dependence-Preserving Data Compaction for Scalable Forensic Analysis , 2018, USENIX Security Symposium.

[71]  Alessandro Orso,et al.  RAIN: Refinable Attack Investigation with On-demand Inter-Process Information Flow Tracking , 2017, CCS.

[72]  Thomas Moyer,et al.  Retrofitting Applications with Provenance-Based Security Monitoring , 2016, ArXiv.

[73]  Xiangyu Zhang,et al.  LDX: Causality Inference by Lightweight Dual Execution , 2016, ASPLOS.

[74]  P. Dhavachelvan,et al.  A survey on data compression techniques: From the perspective of data quality, coding schemes, data type and applications , 2021, J. King Saud Univ. Comput. Inf. Sci..