Context-sensitive and keyword density-based supervised machine learning techniques for malicious webpage detection

Conventional malicious webpage detection methods use blacklists in order to decide whether a webpage is malicious or not. The blacklists are generally maintained by third-party organizations. However, keeping a list of all malicious Web sites and updating this list regularly is not an easy task for the frequently changing and rapidly growing number of webpages on the web. In this study, we propose a novel context-sensitive and keyword density-based method for the classification of webpages by using three supervised machine learning techniques, support vector machine, maximum entropy, and extreme learning machine. Features (words) of webpages are obtained from HTML contents and information is extracted by using feature extraction methods: existence of words, keyword frequencies, and keyword density techniques. The performance of proposed machine learning models is evaluated by using a benchmark data set which consists of one hundred thousand webpages. Experimental results show that the proposed method can detect malicious webpages with an accuracy of 98.24%, which is a significant improvement compared to state-of-the-art approaches.

[1]  Ramón Alberto Carrasco,et al.  A new model for linguistic summarization of heterogeneous data: an application to tourism web data sources , 2012, Soft Comput..

[2]  Paolo Milani Comparetti,et al.  EvilSeed: A Guided Approach to Finding Malicious Web Pages , 2012, 2012 IEEE Symposium on Security and Privacy.

[3]  Andrew McCallum,et al.  Using Maximum Entropy for Text Classification , 1999 .

[4]  Tsuhan Chen,et al.  Malicious web content detection by machine learning , 2010, Expert Syst. Appl..

[5]  Ramana Rao Kompella,et al.  PhishNet: Predictive Blacklisting to Detect Phishing Attacks , 2010, 2010 Proceedings IEEE INFOCOM.

[6]  Hassan B. Kazemian,et al.  Comparisons of machine learning techniques for detecting malicious webpages , 2015, Expert Syst. Appl..

[7]  Fatemeh Zahedi,et al.  Detecting Fake Medical Web Sites Using Recursive Trust Labeling , 2012, TOIS.

[8]  Dianhui Wang,et al.  Extreme learning machines: a survey , 2011, Int. J. Mach. Learn. Cybern..

[9]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[10]  Chee Kheong Siew,et al.  Extreme learning machine: Theory and applications , 2006, Neurocomputing.

[11]  Zhendong Su,et al.  Static detection of cross-site scripting vulnerabilities , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[12]  Somesh Jha,et al.  Testing malware detectors , 2004, ISSTA '04.

[13]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[14]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[15]  Chuanxiong Guo,et al.  Online Detection and Prevention of Phishing Attacks , 2006, 2006 First International Conference on Communications and Networking in China.

[16]  J. Nocedal Updating Quasi-Newton Matrices With Limited Storage , 1980 .

[17]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[18]  Lawrence K. Saul,et al.  Judging a site by its content: learning the textual, structural, and visual features of malicious web pages , 2011, AISec '11.

[19]  Low Tang Jung,et al.  Malicious Web Page Detection: A Machine Learning Approach , 2014 .

[20]  Damien Deville,et al.  SpyProxy: Execution-based Detection of Malicious Web Content , 2007, USENIX Security Symposium.

[21]  Alaa M. El-Halees,et al.  Arabic Text Classification Using Maximum Entropy , 2015 .

[22]  Ajith Abraham,et al.  Web Intelligence and Chance Discovery , 2007, Soft Comput..

[23]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[24]  Giovanni Vigna,et al.  Prophiler: a fast filter for the large-scale detection of malicious web pages , 2011, WWW.

[25]  P. Komisarczuk,et al.  Identification of Malicious Web Pages with Static Heuristics , 2008, 2008 Australasian Telecommunication Networks and Applications Conference.

[26]  Ian Welch,et al.  Identification of malicious web pages through analysis of underlying DNS and web server relationships , 2008, 2008 33rd IEEE Conference on Local Computer Networks (LCN).

[27]  Guang-Bin Huang,et al.  Extreme learning machine: a new learning scheme of feedforward neural networks , 2004, 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541).

[28]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[29]  Hwee Tou Ng,et al.  A maximum entropy approach to information extraction from semi-structured and free text , 2002, AAAI/IAAI.

[30]  Andrew H. Sung,et al.  Detection of Phishing Attacks: A Machine Learning Approach , 2008, Soft Computing Applications in Industry.

[31]  Sophia Ananiadou,et al.  Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty , 2009, ACL.

[32]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[33]  Niels Provos,et al.  The Ghost in the Browser: Analysis of Web-based Malware , 2007, HotBots.

[34]  Tansel Dökeroglu,et al.  Robust multiobjective evolutionary feature subset selection algorithm for binary classification using machine learning techniques , 2017, Neurocomputing.