Detecting scareware by mining variable length instruction sequences

Scareware is a recent type of malicious software that may pose financial and privacy-related threats to novice users. Traditional countermeasures, such as anti-virus software, require regular updates and often lack the capability of detecting novel (unseen) instances. This paper presents a scareware detection method that is based on the application of machine learning algorithms to learn patterns in extracted variable length opcode sequences derived from instruction sequences of binary files. The patterns are then used to classify software as legitimate or scareware but they may also reveal interpretable behavior that is unique to either type of software. We have obtained a large number of real world scareware applications and designed a data set with 550 scareware instances and 250 benign instances. The experimental results show that several common data mining algorithms are able to generate accurate models from the data set. The Random Forest algorithm is shown to outperform the other algorithms in the experiment. Essentially, our study shows that, even though the differences between scareware and legitimate software are subtler than between, say, viruses and legitimate software, the same type of machine learning approach can be used in both of these dissimilar cases.

[1]  Xin Zhao,et al.  The Nocebo Effect on the Web: An Analysis of Fake Anti-Virus Distribution , 2010, LEET.

[2]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[3]  Donald Michie,et al.  Machine learning of rules and trees , 1995 .

[4]  Graham Cluley Sizing up the malware threat - key malware trends for 2010 , 2010, Netw. Secur..

[5]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[6]  Yuval Elovici,et al.  Unknown Malcode Detection Using OPCODE Representation , 2008, EuroISI.

[7]  Salvatore J. Stolfo,et al.  Data mining methods for detection of new malicious executables , 2001, Proceedings 2001 IEEE Symposium on Security and Privacy. S&P 2001.

[8]  Jau-Hwang Wang,et al.  Virus detection using data mining techinques , 2003, IEEE 37th Annual 2003 International Carnahan Conference onSecurity Technology, 2003. Proceedings..

[9]  Luis Corrons The Business of Rogueware , 2010 .

[10]  William W. Cohen Learning Trees and Rules with Set-Valued Features , 1996, AAAI/IAAI, Vol. 1.

[11]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[12]  Andrew H. Sung,et al.  Disassembled code analyzer for malware (DCAM) , 2005, IRI -2005 IEEE International Conference on Information Reuse and Integration, Conf, 2005..

[13]  J. Platt Sequential Minimal Optimization : A Fast Algorithm for Training Support Vector Machines , 1998 .

[14]  Ron Kohavi,et al.  The Case against Accuracy Estimation for Comparing Induction Algorithms , 1998, ICML.

[15]  Angelos D. Keromytis,et al.  An Analysis of Rogue AV Campaigns , 2010, RAID.

[16]  Niklas Lavesson,et al.  Detection of Spyware by Mining Executable Files , 2010, 2010 International Conference on Availability, Reliability and Security.

[17]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[18]  Robert J. Hilderman,et al.  Categorical Proportional Difference: A Feature Selection Method for Text Categorization , 2008, AusDM.