Aiding prediction algorithms in detecting high-dimensional malicious applications using a randomized projection technique

This research paper describes an on-going effort to design, develop and improve upon malicious application detection algorithms. This work looks specifically at improving a cosine similarity, information retrieval technique to enhance detection of known and variances of known malicious applications by applying the feature extraction technique known as randomized projection. Document similarity techniques, such as cosine similarity, have been used with great success in several document retrieval applications. By following a standard information retrieval methodology, software, in machine readable format, can be regarded as documents in the corpus. These "documents" may or may not have a known malicious functionality. The query is software, again in machine readable format, which contains a certain type of malicious software. This methodology provides an ability to search the corpus with a query and retrieve/identify potentially malicious software as well as other instances of the same type of vulnerability. Retrieval is based on the similarity of the query to a given document in the corpus. There have been several efforts to overcome what is known as 'the curse of dimensionality' that can occur with the use of this type of information retrieval technique including mutual information and randomized projections. Randomized projections are used to create a low-order embedding of the high dimensional data. Results from experimentation have shown promise over previously published efforts.

[1]  Wei-Ying Ma,et al.  Learning similarity measures in non-orthogonal space , 2004, CIKM '04.

[2]  Carla Marceau,et al.  Characterizing the behavior of a program using multiple-length N-grams , 2001, NSPW '00.

[3]  E. M. Wright,et al.  Adaptive Control Processes: A Guided Tour , 1961, The Mathematical Gazette.

[4]  John Hallam,et al.  IEEE International Joint Conference on Neural Networks , 2005 .

[5]  Travis Atkison Applying randomized projection to aid prediction algorithms in detecting high-dimensional rogue applications , 2009, ACM-SE 47.

[6]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[7]  Mikko Kurimo Indexing Audio Documents by using Latent Semantic Analysis and SOM , 1999 .

[8]  Heikki Mannila,et al.  Finding similar situations in sequences of events via random projections , 2001, SDM.

[9]  Santosh S. Vempala,et al.  Latent semantic indexing: a probabilistic analysis , 1998, PODS '98.

[10]  Vlado Keselj,et al.  Detection of New Malicious Code Using N-grams Signatures , 2004, PST.

[11]  Vlado Keselj,et al.  N-gram-based detection of new malicious code , 2004, Proceedings of the 28th Annual International Computer Software and Applications Conference, 2004. COMPSAC 2004..

[12]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[13]  Nathalie Japkowicz,et al.  A Feature Selection and Evaluation Scheme for Computer Virus Detection , 2006, Sixth International Conference on Data Mining (ICDM'06).

[14]  Arun K. Pujari,et al.  N-gram analysis for computer virus detection , 2006, Journal in Computer Virology.

[15]  Jeffrey O. Kephart,et al.  Biologically Inspired Defenses Against Computer Viruses , 1995, IJCAI.

[16]  Salvatore J. Stolfo,et al.  Data mining methods for detection of new malicious executables , 2001, Proceedings 2001 IEEE Symposium on Security and Privacy. S&P 2001.

[17]  Dennis J. Turner,et al.  Symantec Internet Security Threat Report Trends for July 04-December 04 , 2005 .

[18]  Marcus A. Maloof,et al.  Learning to Detect and Classify Malicious Executables in the Wild , 2006, J. Mach. Learn. Res..

[19]  Santosh S. Vempala,et al.  Latent Semantic Indexing , 2000, PODS 2000.

[20]  Dimitrios Gunopulos,et al.  Dimensionality reduction by random projection and latent semantic indexing , 2003 .

[21]  Heikki Mannila,et al.  Random projection in dimensionality reduction: applications to image and text data , 2001, KDD '01.

[22]  Samuel Kaski,et al.  Dimensionality reduction by random mapping: fast similarity computation for clustering , 1998, 1998 IEEE International Joint Conference on Neural Networks Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98CH36227).

[23]  Anupam Gupta,et al.  An elementary proof of the Johnson-Lindenstrauss Lemma , 1999 .

[24]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[25]  Santosh S. Vempala,et al.  The Random Projection Method , 2005, DIMACS Series in Discrete Mathematics and Theoretical Computer Science.