Rebuilding the Tower of Babel: Towards Cross-System Malware Information Sharing

Anti-virus systems developed by different vendors often demonstrate strong discrepancies in how they name malware, which signficantly hinders malware information sharing. While existing work has proposed a plethora of malware naming standards, most anti-virus vendors were reluctant to change their own naming conventions. In this paper we explore a new, more pragmatic alternative. We propose to exploit the correlation between malware naming of different anti-virus systems to create their consensus classification, through which these systems can share malware information without modifying their naming conventions. Specifically we present Latin, a novel classification integration framework leveraging the correspondence between participating anti-virus systems as reflected in heterogeneous information sources at instance-instance, instance-name, and name-name levels. We provide results from extensive experimental studies using real malware datasets and concrete use cases to verify the efficacy of Latin in supporting cross-system malware information sharing.

[1]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[2]  William W. Cohen,et al.  Power Iteration Clustering , 2010, ICML.

[3]  Carsten Willems,et al.  Automatic analysis of malware behavior using machine learning , 2011, J. Comput. Secur..

[4]  Tom Kelchner The (in)consistent naming of malcode , 2010 .

[5]  Edwin R. Hancock,et al.  Spectral Clustering of Graphs , 2003, GbRPR.

[6]  Zhuoqing Morley Mao,et al.  Automated Classification and Analysis of Internet Malware , 2007, RAID.

[7]  Vincent Kanade,et al.  Clustering Algorithms , 2021, Wireless RF Energy Transfer in the Massive IoT Era.

[8]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[9]  Erhard Rahm,et al.  Similarity flooding: a versatile graph matching algorithm and its application to schema matching , 2002, Proceedings 18th International Conference on Data Engineering.

[10]  Rajeev Motwani,et al.  Robust and efficient fuzzy match for online data cleaning , 2003, SIGMOD '03.

[11]  Philip S. Yu,et al.  Combining multiple clusterings by soft correspondence , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[12]  David Harley,et al.  A DOSE BY ANY OTHER NAME , 2008 .

[13]  J L Marx,et al.  A virus by any other name . . . , 1985, Science.

[14]  Erhard Rahm,et al.  Generic Schema Matching with Cupid , 2001, VLDB.

[15]  Somesh Jha,et al.  A semantics-based approach to malware detection , 2007, POPL '07.

[16]  Pedro M. Domingos,et al.  Reconciling schemas of disparate data sources: a machine-learning approach , 2001, SIGMOD '01.

[17]  Howard B. Newcombe,et al.  Record linkage: making maximum use of the discriminating power of identifying information , 1962, CACM.

[18]  Ting Wang,et al.  SeMap: a generic mapping construction system , 2008, EDBT '08.

[19]  Yong Chen,et al.  Automatic malware categorization using cluster ensemble , 2010, KDD.

[20]  Stefano Zanero,et al.  Finding Non-trivial Malware Naming Inconsistencies , 2011, ICISS.

[21]  Erhard Rahm,et al.  Generic schema matching, ten years later , 2011, Proc. VLDB Endow..

[22]  Fausto Giunchiglia,et al.  Semantic Matching: Algorithms and Implementation , 2007, J. Data Semant..