NodeMerge: Template Based Efficient Data Reduction For Big-Data Causality Analysis

Today's enterprises are exposed to sophisticated attacks, such as Advanced Persistent Threats~(APT) attacks, which usually consist of stealthy multiple steps. To counter these attacks, enterprises often rely on causality analysis on the system activity data collected from a ubiquitous system monitoring to discover the initial penetration point, and from there identify previously unknown attack steps. However, one major challenge for causality analysis is that the ubiquitous system monitoring generates a colossal amount of data and hosting such a huge amount of data is prohibitively expensive. Thus, there is a strong demand for techniques that reduce the storage of data for causality analysis and yet preserve the quality of the causality analysis. To address this problem, in this paper, we propose NodeMerge, a template based data reduction system for online system event storage. Specifically, our approach can directly work on the stream of system dependency data and achieve data reduction on the read-only file events based on their access patterns. It can either reduce the storage cost or improve the performance of causality analysis under the same budget. Only with a reasonable amount of resource for online data reduction, it nearly completely preserves the accuracy for causality analysis. The reduced form of data can be used directly with little overhead. To evaluate our approach, we conducted a set of comprehensive evaluations, which show that for different categories of workloads, our system can reduce the storage capacity of raw system dependency data by as high as 75.7 times, and the storage capacity of the state-of-the-art approach by as high as 32.6 times. Furthermore, the results also demonstrate that our approach keeps all the causality analysis information and has a reasonably small overhead in memory and hard disk.

[1]  Dake He,et al.  Efficient universal lossless data compression algorithms based on a greedy sequential grammar transform .2. With context models , 2000, IEEE Trans. Inf. Theory.

[2]  Peng Gao,et al.  AIQL: Enabling Efficient Attack Investigation from System Monitoring Data , 2018, USENIX Annual Technical Conference.

[3]  Subbarayan Venkatesan,et al.  Forensic analysis of file system intrusions using improved backtracking , 2005, Third IEEE International Workshop on Information Assurance (IWIA'05).

[4]  Xiangyu Zhang,et al.  High Accuracy Attack Provenance via Binary-based Execution Partition , 2013, NDSS.

[5]  En-Hui Yang,et al.  Efficient universal lossless data compression algorithms based on a greedy sequential grammar transform - Part one: Without context models , 2000, IEEE Trans. Inf. Theory.

[6]  Gordon Fyodor Lyon,et al.  Nmap Network Scanning: The Official Nmap Project Guide to Network Discovery and Security Scanning , 2009 .

[7]  Athanasios V. Vasilakos,et al.  Hierarchical Data Aggregation Using Compressive Sensing (HDACS) in WSNs , 2015, ACM Trans. Sens. Networks.

[8]  Peng Gao,et al.  SAQL: A Stream-based Query System for Real-Time Abnormal System Behavior Detection , 2018, USENIX Security Symposium.

[9]  Fengyuan Xu,et al.  High Fidelity Data Reduction for Big Data Security Dependency Analyses , 2016, CCS.

[10]  Nachiappan Nagappan,et al.  Predicting Subsystem Failures using Dependency Graph Complexities , 2007, The 18th IEEE International Symposium on Software Reliability (ISSRE '07).

[11]  John G. Thistle,et al.  Dependency graph: An algorithm for analysis of generalized parameterized networks , 2015, 2015 American Control Conference (ACC).

[12]  Mu Zhang,et al.  Towards a Timely Causality Analysis for Enterprise Security , 2018, NDSS.

[13]  Xuxian Jiang,et al.  Provenance-Aware Tracing ofWorm Break-in and Contaminations: A Process Coloring Approach , 2006, 26th IEEE International Conference on Distributed Computing Systems (ICDCS'06).

[14]  S. R. Kodituwakku,et al.  COMPARISON OF LOSSLESS DATA COMPRESSION ALGORITHMS FOR TEXT DATA , 2010 .

[15]  Fabian Monrose,et al.  Trail of bytes: efficient support for forensic analysis , 2010, CCS '10.

[16]  Adriane Chapman,et al.  Efficient provenance storage , 2008, SIGMOD Conference.

[17]  Xiangyu Zhang,et al.  ProTracer: Towards Practical Provenance Tracing by Alternating Between Logging and Tainting , 2016, NDSS.

[18]  Marianne Winslett,et al.  SPROV 2.0: A Highly-Configurable Platform-Independent Library for Secure Provenance , 2009 .

[19]  Latifur Khan,et al.  SGX-Log: Securing System Logs With SGX , 2017, AsiaCCS.

[20]  Ke Wang,et al.  Top Down FP-Growth for Association Rule Mining , 2002, PAKDD.

[21]  Elisa Bertino,et al.  Provenance-aware security risk analysis for hosts and network flows , 2014, 2014 IEEE Network Operations and Management Symposium (NOMS).

[22]  Marianne Winslett,et al.  Preventing history forgery with secure provenance , 2009, TOS.

[23]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[24]  Hong Jiang,et al.  CABdedupe: A Causality-Based Deduplication Performance Booster for Cloud Backup Services , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[25]  Edward Y. Chang,et al.  Pfp: parallel fp-growth for query recommendation , 2008, RecSys '08.

[26]  Thomas Moyer,et al.  Transparent Web Service Auditing via Network Provenance Functions , 2017, WWW.

[27]  Thomas Groß,et al.  Cloud radar: near real-time detection of security failures in dynamic virtualized infrastructures , 2014, ACSAC.

[28]  Sri Parameswaran,et al.  SDG2KPN: System Dependency Graph to function-level KPN generation of legacy code for MPSoCs , 2014, 2014 19th Asia and South Pacific Design Automation Conference (ASP-DAC).

[29]  Eyal de Lara,et al.  The taser intrusion recovery system , 2005, SOSP '05.

[30]  Yulai Xie,et al.  A hybrid approach for efficient provenance storage , 2012, CIKM '12.

[31]  Naren Ramakrishnan,et al.  Detection of stealthy malware activities with traffic causality and scalable triggering relation discovery , 2014, AsiaCCS.

[32]  Mu Zhang,et al.  Semantics-Aware Android Malware Classification Using Weighted Contextual API Dependency Graphs , 2014, CCS.

[33]  Cheng Fang,et al.  Identifying user clicks based on dependency graph , 2014, 2014 23rd Wireless and Optical Communication Conference (WOCC).

[34]  Naren Ramakrishnan,et al.  Causality reasoning about network events for detecting stealthy malware activities , 2016, Comput. Secur..

[35]  Trent Jaeger,et al.  Taming the Costs of Trustworthy Provenance through Policy Reduction , 2017, ACM Trans. Internet Techn..

[36]  Jian Ouyang,et al.  FPGA implementation of GZIP compression and decompression for IDC services , 2010, 2010 International Conference on Field-Programmable Technology.

[37]  Xiangyu Zhang,et al.  Accurate, Low Cost and Instrumentation-Free Security Audit Logging for Windows , 2015, ACSAC.

[38]  Xiangyu Zhang,et al.  LogGC: garbage collecting audit log , 2013, CCS.