Feature Extraction and Feature Selection: Reducing Data Complexity With Apache Spark

Feature extraction and feature selection are the first tasks in pre-processing of input logs in order to detect cyber security threats and attacks while utilizing machine learning. When it comes to the analysis of heterogeneous data derived from different sources, these tasks are found to be time-consuming and difficult to be managed efficiently. In this paper, we present an approach for handling feature extraction and feature selection for security analytics of heterogeneous data derived from different network sensors. The approach is implemented in Apache Spark, using its python API, named pyspark.

[1]  Sven Dietrich,et al.  Detecting zero-day attacks using context-aware anomaly detection at the application-layer , 2017, International Journal of Information Security.

[2]  Ralf Möller,et al.  Using a Deep Understanding of Network Activities for Security Event Management , 2016 .

[3]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[4]  Hans-Peter Kriegel,et al.  A survey on unsupervised outlier detection in high‐dimensional numerical data , 2012, Stat. Anal. Data Min..

[5]  Yash Punjabi,et al.  SECURITY ISSUES ASSOCIATED WITH BIG DATA IN CLOUD COMPUTING , 2017 .

[6]  Veronika Kuchta,et al.  A Categorical Approach in Handling Event-Ordering in Distributed Systems , 2016, 2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS).

[7]  Olivier Markowitch,et al.  A Framework for Threat Detection in Communication Systems , 2016, PCI.

[8]  Eric Michael Hutchins,et al.  Intelligence-Driven Computer Network Defense Informed by Analysis of Adversary Campaigns and Intrusion Kill Chains , 2010 .

[9]  Vikas Sindhwani,et al.  Emerging topic detection using dictionary learning , 2011, CIKM '11.

[10]  Nicolas Goix,et al.  How to Evaluate the Quality of Unsupervised Anomaly Detection Algorithms? , 2016, ArXiv.

[11]  Kalyan Veeramachaneni,et al.  AI^2: Training a Big Data Machine to Defend , 2016, 2016 IEEE 2nd International Conference on Big Data Security on Cloud (BigDataSecurity), IEEE International Conference on High Performance and Smart Computing (HPSC), and IEEE International Conference on Intelligent Data and Security (IDS).

[12]  Charu C. Aggarwal,et al.  Data Clustering: Algorithms and Applications , 2014 .

[13]  Mei-Ling Shyu,et al.  Efficient Mining and Detection of Sequential Intrusion Patterns for Network Intrusion Detection Systems , 2009 .