N-grams-based File Signatures for Malware Detection

Malware is any malicious code that has the potential to harm any computer or network. The amount of malware is increasing faster every year and poses a serious security threat. Thus, malware detection is a critical topic in computer security. Currently, signature-based detection is the most extended method for detecting malware. Although this method is still used on most popular commercial computer antivirus software, it can only achieve detection once the virus has already caused damage and it is registered. Therefore, it fails to detect new malware. Applying a methodology proven successful in similar problem-domains, we propose the use of ngrams (every substring of a larger string, of a fixed lenght n) as file signatures in order to detect unknown malware whilst keeping low false positive ratio. We show that n-grams signatures provide an effective way to detect unknown malware.

[1]  Shuigeng Zhou,et al.  Chinese Documents Classification Based on N-Grams , 2002, CICLing.

[2]  Jeffrey O. Kephart,et al.  A biologically inspired immune system for computers , 1994 .

[3]  Maya Gokhale,et al.  Language classification using n-grams accelerated by FPGA-based Bloom filters , 2007, HPRCTA.

[4]  J. L. Hodges,et al.  Discriminatory Analysis - Nonparametric Discrimination: Small Sample Performance , 1952 .

[5]  Vlado Keselj,et al.  N-gram-based detection of new malicious code , 2004, Proceedings of the 28th Annual International Computer Software and Applications Conference, 2004. COMPSAC 2004..

[6]  Carey Nachenberg,et al.  Computer virus-antivirus coevolution , 1997, Commun. ACM.

[7]  Salvatore J. Stolfo,et al.  Data mining methods for detection of new malicious executables , 2001, Proceedings 2001 IEEE Symposium on Security and Privacy. S&P 2001.