Marmite: Spreading Malicious File Reputation Through Download Graphs

Effective malware detection approaches need not only high accuracy, but also need to be robust to changes in the modus operandi of criminals. In this paper, we propose Marmite, a feature-agnostic system that aims at propagating known malicious reputation of certain files to unknown ones with the goal of detecting malware. Marmite does this by looking at a graph that encapsulates a comprehensive view of how files are downloaded (by which hosts and from which servers) on a global scale. The reputation of files is then propagated across the graph using semi-supervised label propagation with Bayesian confidence. We show that Marmite is able to reach high accuracy (0.94 G-mean on average) over a 10-day dataset of 200 million download events. We also demonstrate that Marmite's detection capabilities do not significantly degrade over time, by testing our system on a 30-day dataset of 660 million download events collected six months after the system was tuned and validated. Marmite still maintains a similar accuracy after this period of time.

[1]  Cristiano Giuffrida,et al.  Detection of Intrusions and Malware, and Vulnerability Assessment , 2018, Lecture Notes in Computer Science.

[2]  Christopher Krügel,et al.  Nazca: Detecting Malware Distribution in Large-Scale Networks , 2014, NDSS.

[3]  Leyla Bilge,et al.  Cutting the Gordian Knot: A Look Under the Hood of Ransomware Attacks , 2015, DIMVA.

[4]  Roberto Perdisci,et al.  WebWitness: Investigating, Categorizing, and Mitigating Malware Download Paths , 2015, USENIX Security Symposium.

[5]  Sandeep Yadav,et al.  Detecting Malicious Domains via Graph Inference , 2014, ESORICS.

[6]  Christopher Krügel,et al.  Your botnet is my botnet: analysis of a botnet takeover , 2009, CCS.

[7]  Wenke Lee,et al.  ARROW: GenerAting SignatuRes to Detect DRive-By DOWnloads , 2011, WWW.

[8]  Niels Provos,et al.  CAMP: Content-Agnostic Malware Protection , 2013, NDSS.

[9]  Paolo Milani Comparetti,et al.  EvilSeed: A Guided Approach to Finding Malicious Web Pages , 2012, 2012 IEEE Symposium on Security and Privacy.

[10]  Zoubin Ghahramani,et al.  Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions , 2003, ICML 2003.

[11]  Christos Faloutsos,et al.  Polonium: Tera-Scale Graph Mining and Inference for Malware Detection , 2011 .

[12]  Duen Horng Chau,et al.  Guilt by association: large scale malware detection by mining file-relation graphs , 2014, KDD.

[13]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[14]  Chris Sharp,et al.  Investigating Commercial Pay-Per-Install and the Distribution of Unwanted Software , 2016, USENIX Security Symposium.

[15]  Son Lam Phung,et al.  Learning Pattern Classification Tasks with Imbalanced Data Sets , 2009 .

[16]  Zhenkai Liang,et al.  BitBlaze: A New Approach to Computer Security via Binary Analysis , 2008, ICISS.

[17]  Felix C. Freiling,et al.  Toward Automated Dynamic Malware Analysis Using CWSandbox , 2007, IEEE Secur. Priv..

[18]  Stefan Savage,et al.  Manufacturing compromise: the emergence of exploit-as-a-service , 2012, CCS.

[19]  Leyla Bilge,et al.  Measuring PUP Prevalence and PUP Distribution through Pay-Per-Install Services , 2016, USENIX Security Symposium.

[20]  Christos Faloutsos,et al.  SocNL: Bayesian Label Propagation with Confidence , 2015, PAKDD.

[21]  Vern Paxson,et al.  Measuring Pay-per-Install: The Commoditization of Malware Distribution , 2011, USENIX Security Symposium.

[22]  Gianluca Stringhini,et al.  Shady paths: leveraging surfing crowds to detect malicious web pages , 2013, CCS.

[23]  Christopher Krügel,et al.  Revolver: An Automated Approach to the Detection of Evasive Web-based Malware , 2013, USENIX Security Symposium.

[24]  Gianluca Stringhini,et al.  The Underground Economy of Spam: A Botmaster's Perspective of Coordinating Large-Scale Spam Campaigns , 2011, LEET.

[25]  Antonio Nucci,et al.  Detecting malicious HTTP redirections using trees of user browsing activity , 2014, IEEE INFOCOM 2014 - IEEE Conference on Computer Communications.

[26]  Herbert Bos,et al.  Large-Scale Analysis of Malware Downloaders , 2012, DIMVA.

[27]  Leyla Bilge,et al.  The Dropper Effect: Insights into Malware Distribution with Downloader Graph Analytics , 2015, CCS.

[28]  Chris Kanich,et al.  Taster's choice: a comparative analysis of spam feeds , 2012, Internet Measurement Conference.

[29]  Fang Yu,et al.  Finding the Linchpins of the Dark Web: a Study on Topologically Dedicated Hosts on Malicious Web Infrastructures , 2013, 2013 IEEE Symposium on Security and Privacy.

[30]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[31]  Christopher Krügel,et al.  BareCloud: Bare-metal Analysis-based Evasive Malware Detection , 2014, USENIX Security Symposium.

[32]  Roberto Perdisci,et al.  From Throw-Away Traffic to Bots: Detecting the Rise of DGA-Based Malware , 2012, USENIX Security Symposium.

[33]  Nick Feamster,et al.  Can DNS-Based Blacklists Keep Up with Bots? , 2006, CEAS.

[34]  Babak Rahbarinia,et al.  Real-Time Detection of Malware Downloads via Large-Scale URL->File->Machine Graph Mining , 2016, AsiaCCS.

[35]  Christopher Krügel,et al.  A survey on automated dynamic malware-analysis techniques and tools , 2012, CSUR.

[36]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[37]  Felix C. Freiling,et al.  Measuring and Detecting Fast-Flux Service Networks , 2008, NDSS.

[38]  Nikos Karampatziakis,et al.  Using File Relationships in Malware Classification , 2012, DIMVA.

[39]  Kang Li,et al.  Measuring and Detecting Malware Downloads in Live Network Traffic , 2013, ESORICS.

[40]  Jack W. Stokes,et al.  WebCop: Locating Neighborhoods of Malware on the Web , 2010, LEET.

[41]  Christopher Krügel,et al.  Static Disassembly of Obfuscated Binaries , 2004, USENIX Security Symposium.