Time Split Based Pre-processing with a Data-Driven Approach for Malicious URL Detection

Malicious uniform resource locator (URL) host unsolicited content and are a serious threat and are used to commit cyber crime. Malicious URL’s are responsible for various cyber attacks like spamming, identity theft, financial fraud, etc. The internet growth has also resulted in increase of fraudulent activities in the web. The classical methods like blacklisting is ineffective in detecting newly generated malicious URL’s. So there arises a need to develop an effective algorithm to detect and classify the malicious URL’s. At the same time the recent advancement in the field of machine learning had shown promising results in areas like image processing, Natural language processing (NLP) and other domains. This motivates us to move in the direction of machine learning based techniques for detecting and classifying URL’s. However, there are significant challenges in detecting malicious URL’s that needs to be answered. First and foremost any available data used in detecting malicious URL’s is outdated. This makes the model difficult to be deployed in real time scenario. Secondly the inability to capture semantic and sequential information affects the generalization to the test data. In order to overcome these shortcomings we introduce the concept of time split and random split on the training data. Random split will randomly split the data for training and testing. Whereas time split will split the data based on time information of the URL’s. This in turn is followed by different representation of the data. These representation are passed to the classical machine learning and deep learning techniques to evaluate the performance. The analysis for data set from Sophos Machine Learning building blocks tutorial shows better performance for time split based grouping of data with decision tree classifier and an accuracy of 88.5%. Additionally, highly scalable framework is designed to collect data from various data sources in a passive way inside an Ethernet LAN. The proposed framework can collect data in real time and process in a distributed way to provide situational awareness. The proposed framework can be easily extended to handle vary large amount of cyber events by adding additional resources to the existing system.

[1]  K. P. Soman,et al.  Applying convolutional neural network for network intrusion detection , 2017, 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI).

[2]  K. P. Soman,et al.  DeepAnti-PhishNet: Applying deep neural networks for phishing email detection CEN-AISecurity@IWSPA-2018 , 2018 .

[3]  K. P. Soman,et al.  Detecting malicious domain names using deep learning approaches at scale , 2018, J. Intell. Fuzzy Syst..

[4]  Mohamed Elhoseny,et al.  A Framework for Big Data Analysis in Smart Cities , 2018, AMLTA.

[5]  Prabaharan Poornachandran,et al.  ScaleNet: Scalable and Hybrid Frameworkfor Cyber Threat Situational AwarenessBased on DNS, URL, and Email Data Analysis , 2019, J. Cyber Secur. Mobil..

[6]  K. P. Soman,et al.  Applying deep learning approaches for network traffic prediction , 2017, 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI).

[7]  Dawei Wang,et al.  Malicious Web Pages Detection Based on Abnormal Visibility Recognition , 2009, 2009 International Conference on E-Business and Information System Security.

[8]  K. P. Soman,et al.  Deep android malware detection and classification , 2017, 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI).

[9]  R. Vinayakumar,et al.  DeepMalNet: Evaluating shallow and deep networks for static PE malware detection , 2018, ICT Express.

[10]  K. P. Soman,et al.  Evaluating shallow and deep networks for ransomware detection and classification , 2017, 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI).

[11]  RYAN HEARTFIELD,et al.  A Taxonomy of Attacks and a Survey of Defence Mechanisms for Semantic Social Engineering Attacks , 2015, ACM Comput. Surv..

[12]  K. P. Soman,et al.  Secure shell (ssh) traffic analysis with flow based features using shallow and deep networks , 2017, 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI).

[13]  Minaxi Gupta,et al.  Behind Phishing: An Examination of Phisher Modi Operandi , 2008, LEET.

[14]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[15]  Mohamed Elhoseny,et al.  Dynamic Wireless Sensor Networks , 2019, Studies in Systems, Decision and Control.

[16]  J. B. Patil,et al.  Survey on Malicious Web Pages Detection Techniques , 2015 .

[17]  Mohamed Elhoseny,et al.  Self-maintenance model for Wireless Sensor Networks , 2017, Comput. Electr. Eng..

[18]  Prabaharan Poornachandran,et al.  Scalable Framework for Cyber Threat Situational Awareness Based on Domain Name Systems Data Analysis , 2018 .

[19]  K. P. Soman,et al.  From Vector Space Models to Vector Space Models of Semantics , 2016, FIRE Workshop.

[20]  K. P. Soman,et al.  Evaluation of Recurrent Neural Network and its Variants for Intrusion Detection System (IDS) , 2017, Int. J. Inf. Syst. Model. Des..

[21]  Niels Provos,et al.  A framework for detection and measurement of phishing attacks , 2007, WORM '07.

[22]  K. P. Soman,et al.  Detecting Android malware using Long Short-term Memory (LSTM) , 2018, J. Intell. Fuzzy Syst..

[23]  Mohamed Elhoseny,et al.  Feature selection based on artificial bee colony and gradient boosting decision tree , 2019, Appl. Soft Comput..

[24]  Erik Cambria,et al.  Recent Trends in Deep Learning Based Natural Language Processing , 2017, IEEE Comput. Intell. Mag..

[25]  K. P. Soman,et al.  Evaluating effectiveness of shallow and deep networks to intrusion detection system , 2017, 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI).

[26]  K. P. Soman,et al.  Evaluating shallow and deep networks for secure shell (ssh)traffic analysis , 2017, 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI).

[27]  P SomanK.,et al.  S.P.O.O.F Net: Syntactic Patterns for identification of Ominous Online Factors , 2018, 2018 IEEE Security and Privacy Workshops (SPW).

[28]  Jason Hong,et al.  The state of phishing attacks , 2012, Commun. ACM.

[29]  Masahiro Kuyama,et al.  Method for Detecting a Malicious Domain by Using WHOIS and DNS Features , 2016 .

[30]  Min-Yen Kan,et al.  Fast webpage classification using URL features , 2005, CIKM '05.