Extending Isolation Forest for Anomaly Detection in Big Data via K-Means

Industrial Information Technology infrastructures are often vulnerable to cyberattacks. To ensure security to the computer systems in an industrial environment, it is required to build effective intrusion detection systems to monitor the cyber-physical systems (e.g., computer networks) in the industry for malicious activities. This article aims to build such intrusion detection systems to protect the computer networks from cyberattacks. More specifically, we propose a novel unsupervised machine learning approach that combines the K-Means algorithm with the Isolation Forest for anomaly detection in industrial big data scenarios. Since our objective is to build the intrusion detection system for the big data scenario in the industrial domain, we utilize the Apache Spark framework to implement our proposed model that was trained in large network traffic data (about 123 million instances of network traffic) stored in Elasticsearch. Moreover, we evaluate our proposed model on the live streaming data and find that our proposed system can be used for real-time anomaly detection in the industrial setup. In addition, we address different challenges that we face while training our model on large datasets and explicitly describe how these issues were resolved. Based on our empirical evaluation in different use cases for anomaly detection in real-world network traffic data, we observe that our proposed system is effective to detect anomalies in big data scenarios. Finally, we evaluate our proposed model on several academic datasets to compare with other models and find that it provides comparable performance with other state-of-the-art approaches.

[1]  Cyrus Shahabi,et al.  Distance-based Outlier Detection in Data Streams , 2016, Proc. VLDB Endow..

[2]  Raghavendra Chalapathy University of Sydney,et al.  Deep Learning for Anomaly Detection: A Survey , 2019, ArXiv.

[3]  Hussein Mouftah,et al.  A Comparative Study of AI-Based Intrusion Detection Techniques in Critical Infrastructures , 2020, ACM Trans. Internet Techn..

[4]  Salimur Choudhury,et al.  A localized fault tolerant load balancing algorithm for RFID systems , 2018, J. Ambient Intell. Humaniz. Comput..

[5]  Luo Si,et al.  York University at TREC 2007: Genomics Track , 2005, TREC.

[6]  Xiangji Huang,et al.  Mining Online Reviews for Predicting Sales Performance: A Case Study in the Movie Domain , 2012, IEEE Transactions on Knowledge and Data Engineering.

[7]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[8]  Mansoor Alam,et al.  A Deep Learning Approach for Network Intrusion Detection System , 2016, EAI Endorsed Trans. Security Safety.

[9]  Tommi Kärkkäinen,et al.  Improving Scalable K-Means++ , 2020, Algorithms.

[10]  Wooju Kim,et al.  Unsupervised learning approach for network intrusion detection system using autoencoders , 2019, The Journal of Supercomputing.

[11]  Qinmin Hu,et al.  A bayesian learning approach to promoting diversity in ranking for biomedical information retrieval , 2009, SIGIR.

[12]  Sergei Vassilvitskii,et al.  Scalable K-Means++ , 2012, Proc. VLDB Endow..

[13]  Jimmy Xiangji Huang,et al.  WSL-DS: Weakly Supervised Learning with Distant Supervision for Query Focused Multi-Document Abstractive Summarization , 2020, COLING.

[14]  Ejaz Ahmed,et al.  Real-time big data processing for anomaly detection: A Survey , 2019, Int. J. Inf. Manag..

[15]  Bin Liu,et al.  A Dependable Time Series Analytic Framework for Cyber-Physical Systems of IoT-based Smart Grid , 2018, ACM Trans. Cyber Phys. Syst..

[16]  Graham J. Williams,et al.  On-Line Unsupervised Outlier Detection Using Finite Mixtures with Discounting Learning Algorithms , 2000, KDD '00.

[17]  Shafiq R. Joty,et al.  MultiMix: A Robust Data Augmentation Strategy for Cross-Lingual NLP , 2020, ArXiv.

[18]  Seref Sagiroglu,et al.  Big data analytics for network anomaly detection from netflow data , 2017, 2017 International Conference on Computer Science and Engineering (UBMK).

[19]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[20]  Arquimedes Canedo,et al.  Confidentiality Breach Through Acoustic Side-Channel in Cyber-Physical Additive Manufacturing Systems , 2017, ACM Trans. Cyber Phys. Syst..

[21]  L. Green,et al.  Area under the curve as a measure of discounting. , 2001, Journal of the experimental analysis of behavior.

[22]  V. Rao Vemuri,et al.  Use of K-Nearest Neighbor classifier for intrusion detection , 2002, Comput. Secur..

[23]  Yaser Jararweh,et al.  An intrusion detection system for connected vehicles in smart cities , 2019, Ad Hoc Networks.

[24]  Reynold Xin,et al.  Apache Spark , 2016 .

[25]  Marina Thottan,et al.  Anomaly detection in IP networks , 2003, IEEE Trans. Signal Process..

[26]  Andrew H. Sung,et al.  Intrusion detection using neural networks and support vector machines , 2002, Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 (Cat. No.02CH37290).

[27]  S. Horvath,et al.  Unsupervised Learning With Random Forest Predictors , 2006 .

[28]  C. Kirschbaum,et al.  Two formulas for computation of the area under the curve represent measures of total hormone concentration versus time-dependent change , 2003, Psychoneuroendocrinology.

[29]  Zhoujun Li,et al.  A Survival Modeling Approach to Biomedical Search Result Diversification Using Wikipedia , 2010, IEEE Transactions on Knowledge and Data Engineering.

[30]  Paul Geladi,et al.  Principal Component Analysis , 1987, Comprehensive Chemometrics.

[31]  Reynold Xin,et al.  Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark , 2018, SIGMOD Conference.

[32]  H. T. Mouftah,et al.  Adaptively Supervised and Intrusion-Aware Data Aggregation for Wireless Sensor Clusters in Critical Infrastructures , 2018, 2018 IEEE International Conference on Communications (ICC).

[33]  Yuefei Zhu,et al.  A Deep Learning Approach for Intrusion Detection Using Recurrent Neural Networks , 2017, IEEE Access.

[34]  Stephen D. Bay,et al.  Mining distance-based outliers in near linear time with randomization and a simple pruning rule , 2003, KDD '03.

[35]  Zhi-Hua Zhou,et al.  Isolation Forest , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[36]  Xiangji Huang,et al.  Contextualized Embeddings based Transformer Encoder for Sentence Similarity Modeling in Answer Selection Task , 2020, LREC.

[37]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[38]  Amit Kumar Sikder,et al.  HealthGuard: A Machine Learning-Based Security Framework for Smart Healthcare Systems , 2019, 2019 Sixth International Conference on Social Networks Analysis, Management and Security (SNAMS).

[39]  Neelam Sharma,et al.  INTRUSION DETECTION USING NAIVE BAYES CLASSIFIER WITH FEATURE REDUCTION , 2012 .

[40]  Xiangji Huang,et al.  Mining network data for intrusion detection through combining SVMs with ant colony networks , 2014, Future Gener. Comput. Syst..

[41]  Jing Tian,et al.  Anomaly Detection Using Self-Organizing Maps-Based K-Nearest Neighbor Algorithm , 2014 .

[42]  Mohsen Amini Salehi,et al.  ClustCrypt: Privacy-Preserving Clustering of Unstructured Big Data in the Cloud , 2019, 2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS).

[43]  Mohiuddin Ahmed,et al.  A survey of network anomaly detection techniques , 2016, J. Netw. Comput. Appl..

[44]  H. T. Mouftah,et al.  Empowering Reinforcement Learning on Big Sensed Data for Intrusion Detection , 2019, ICC 2019 - 2019 IEEE International Conference on Communications (ICC).

[45]  Daniel Kudenko,et al.  Multi-agent Reinforcement Learning for Intrusion Detection , 2007, Adaptive Agents and Multi-Agents Systems.

[46]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[47]  Abdeltawab M. Hendawi,et al.  Data Sets, Modeling, and Decision Making in Smart Cities , 2019, ACM Trans. Cyber Phys. Syst..

[48]  K. P. Soman,et al.  Applying convolutional neural network for network intrusion detection , 2017, 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI).

[49]  João Gama,et al.  A survey on concept drift adaptation , 2014, ACM Comput. Surv..

[50]  Zhongfu Wu,et al.  Clustering based on Self-Organizing Ant Colony Networks with Application to Intrusion Detection , 2006, Sixth International Conference on Intelligent Systems Design and Applications.

[51]  Witold Pedrycz,et al.  K-Means-based isolation forest , 2020, Knowl. Based Syst..

[52]  Dongmei Zhang,et al.  Systematically Ensuring the Confidence of Real-Time Home Automation IoT Systems , 2018, ACM Trans. Cyber Phys. Syst..

[53]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[54]  Xiangji Huang,et al.  Boosting Prediction Accuracy on Imbalanced Datasets with SVM Ensembles , 2006, PAKDD.

[55]  Qi Shi,et al.  A Deep Learning Approach to Network Intrusion Detection , 2018, IEEE Transactions on Emerging Topics in Computational Intelligence.

[56]  Maurizio Filippone,et al.  A comparative evaluation of outlier detection algorithms: Experiments and analyses , 2018, Pattern Recognit..

[57]  Xin Xu,et al.  A Reinforcement Learning Approach for Host-Based Intrusion Detection Using Sequences of System Calls , 2005, ICIC.

[58]  Slim Abdennadher,et al.  Enhancing one-class support vector machines for unsupervised anomaly detection , 2013, ODD '13.

[59]  Guojun Lu,et al.  Distortion Robust Image Classification Using Deep Convolutional Neural Network with Discrete Cosine Transform , 2018, 2019 IEEE International Conference on Image Processing (ICIP).

[60]  Ben He,et al.  Modeling term proximity for probabilistic information retrieval models , 2011, Inf. Sci..

[61]  Suleyman Serdar Kozat,et al.  Unsupervised Anomaly Detection With LSTM Neural Networks , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[62]  Paulo Tabuada,et al.  SMT-Based Observer Design for Cyber-Physical Systems under Sensor Attacks , 2016, 2016 ACM/IEEE 7th International Conference on Cyber-Physical Systems (ICCPS).

[63]  Andreas Krause,et al.  Fast and Provably Good Seedings for k-Means , 2016, NIPS.

[64]  Satinder Singh,et al.  Unsupervised Anomaly Detection in Network Intrusion Detection Using Clusters , 2005, ACSC.

[65]  Peter W. Tse,et al.  Anomaly Detection Through a Bayesian Support Vector Machine , 2010, IEEE Transactions on Reliability.

[66]  Ali Ridho Barakbah,et al.  Reinforced Intrusion Detection Using Pursuit Reinforcement Competitive Learning , 2014 .

[67]  Sirajum Munir,et al.  Reliable Communication and Latency Bound Generation in Wireless Cyber-Physical Systems , 2019, ACM Trans. Cyber Phys. Syst..

[68]  Jimmy Xiangji Huang,et al.  Utilizing Bidirectional Encoder Representations from Transformers for Answer Selection , 2020, Springer Proceedings in Mathematics & Statistics.

[69]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[70]  Robert J. Brunner,et al.  Extended Isolation Forest , 2018, IEEE Transactions on Knowledge and Data Engineering.

[71]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[72]  Burak Kantarci,et al.  Detection of Known and Unknown Intrusive Sensor Behavior in Critical Applications , 2017, IEEE Sensors Letters.

[73]  Burak Kantarci,et al.  A Novel Ensemble Method for Advanced Intrusion Detection in Wireless Sensor Networks , 2020, ICC 2020 - 2020 IEEE International Conference on Communications (ICC).

[74]  Xiaohui Yu,et al.  ARSA: a sentiment-aware model for predicting sales performance using blogs , 2007, SIGIR.

[75]  Shafiq R. Joty,et al.  Zero-Resource Cross-Lingual Named Entity Recognition , 2019, AAAI.

[76]  Sanjay Ghemawat,et al.  MapReduce: a flexible data processing tool , 2010, CACM.

[77]  Sami Sieranoja,et al.  How much can k-means be improved by using better initialization and repeats? , 2019, Pattern Recognit..

[78]  Li Tian,et al.  Anomaly Detection Based on RBM-LSTM Neural Network for CPS in Advanced Driver Assistance System , 2020, ACM Trans. Cyber Phys. Syst..

[79]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[80]  Fei Tony Liu,et al.  Isolation-Based Anomaly Detection , 2012, TKDD.

[81]  Xiangji Huang,et al.  Query Focused Abstractive Summarization via Incorporating Query Relevance and Transfer Learning with Transformer Models , 2020, Canadian Conference on AI.

[82]  Vern Paxson,et al.  Outside the Closed World: On Using Machine Learning for Network Intrusion Detection , 2010, 2010 IEEE Symposium on Security and Privacy.

[83]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[84]  Beatrice Lazzerini,et al.  Combining supervised and unsupervised learning for data clustering , 2006, Neural Computing & Applications.

[85]  M. A. Jabbar,et al.  Random Forest Modeling for Network Intrusion Detection System , 2016 .

[86]  Stephen E. Robertson,et al.  Applying Machine Learning to Text Segmentation for Information Retrieval , 2004, Information Retrieval.