Learning communication patterns for malware discovery in HTTPs data

Abstract Encrypted communication on the Internet using the HTTPs protocol represents a challenging task for network intrusion detection systems. While it significantly helps to preserve users’ privacy, it also limits a detection system’s ability to understand the traffic and effectively identify malicious activities. In this work, we propose a method for modeling and representation of encrypted communication from logs of web communication. The idea is based on introducing communication snapshots of individual users’ activity that model contextual information of the encrypted requests. This helps to compensate the information hidden by the encryption. We then propose statistical descriptors of the communication snapshots that can be consumed by various machine learning algorithms for either supervised or unsupervised analysis of the data. In the experimental evaluation, we show that the presented approach can be used even on a large corpus of network traffic logs as the process of creation of the descriptors can be effectively implemented on a Hadoop cluster.

[1]  Maurizio Mongelli,et al.  DNS tunneling detection through statistical fingerprints of protocol messages and machine learning , 2015, Int. J. Commun. Syst..

[2]  Jakub Lokoc,et al.  Malware Discovery Using Behaviour-Based Exploration of Network Traffic , 2017, SISAP.

[3]  He Deng,et al.  A P2P Network Traffic Classification Method Using SVM , 2008, 2008 The 9th International Conference for Young Computer Scientists.

[4]  Antonio Pescapè,et al.  Traffic identification engine: an open platform for traffic classification , 2014, IEEE Network.

[5]  Michal Pechoucek,et al.  Dynamic information source selection for intrusion detection systems , 2009, AAMAS.

[6]  Maurizio Dusi,et al.  Traffic classification through simple statistical fingerprinting , 2007, CCRV.

[7]  Satoshi Kondo,et al.  Botnet Traffic Detection Techniques by C&C Session Classification Using SVM , 2007, IWSEC.

[8]  Bo Yang,et al.  Traffic classification using probabilistic neural networks , 2010, 2010 Sixth International Conference on Natural Computation.

[9]  W. Pitts,et al.  A Logical Calculus of the Ideas Immanent in Nervous Activity (1943) , 2021, Ideas That Created the Future.

[10]  Martin Drasar Protocol-Independent Detection of Dictionary Attacks , 2013, EUNICE.

[11]  Jakub Lokoc,et al.  Feature Extraction and Malware Detection on Large HTTPS Data Using MapReduce , 2016, SISAP.

[12]  Senén Barro,et al.  Do we need hundreds of classifiers to solve real world classification problems? , 2014, J. Mach. Learn. Res..

[13]  Jakub Lokoc,et al.  k-NN Classification of Malware in HTTPS Traffic Using the Metric Space Approach , 2016, PAISI.

[14]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[15]  Ahmed Eldawy,et al.  TAREEG: a MapReduce-based system for extracting spatial data from OpenStreetMap , 2014, SIGSPATIAL/GIS.

[16]  Mooi Choo Chuah,et al.  Detection and Classification of Different Botnet C&C Channels , 2011, ATC.

[17]  Andrzej Duda,et al.  Markov chain fingerprinting to classify encrypted traffic , 2014, IEEE INFOCOM 2014 - IEEE Conference on Computer Communications.

[18]  Ran Dubin,et al.  Analyzing HTTPS encrypted traffic to identify user's operating system, browser and application , 2016, 2017 14th IEEE Annual Consumer Communications & Networking Conference (CCNC).

[19]  Shailendra Sahu,et al.  Network intrusion detection system using J48 Decision Tree , 2015, 2015 International Conference on Advances in Computing, Communications and Informatics (ICACCI).

[20]  Xiaohong Guan,et al.  An SVM-based machine learning method for accurate internet traffic classification , 2010, Inf. Syst. Frontiers.

[21]  Jan Kohout,et al.  Unsupervised detection of malware in persistent web traffic , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Ming Yang,et al.  Large-scale image classification: Fast feature extraction and SVM training , 2011, CVPR 2011.

[23]  Jaume Amores,et al.  Multiple instance classification: Review, taxonomy and comparative study , 2013, Artif. Intell..

[24]  Erik Hjelmvik,et al.  Statistical Protocol IDentification with SPID: Preliminary Results , 2009 .

[25]  Youngseok Lee,et al.  Toward scalable internet traffic measurement and analysis with Hadoop , 2013, CCRV.

[26]  Maurizio Dusi,et al.  Tunnel Hunter: Detecting application-layer tunnels with statistical fingerprinting , 2009, Comput. Networks.

[27]  Christopher Krügel,et al.  BotFinder: finding bots in network traffic without deep packet inspection , 2012, CoNEXT '12.

[28]  Charles V. Wright,et al.  On Inferring Application Protocol Behaviors in Encrypted Network Traffic , 2006, J. Mach. Learn. Res..

[29]  Jan Kohout,et al.  Automatic discovery of web servers hosting similar applications , 2015, 2015 IFIP/IEEE International Symposium on Integrated Network Management (IM).

[30]  Mario Marchese,et al.  Statistical fingerprint‐based intrusion detection system (SF‐IDS) , 2017, Int. J. Commun. Syst..

[31]  Leyla Bilge,et al.  Disclosure: detecting botnet command and control servers through large-scale NetFlow analysis , 2012, ACSAC '12.

[32]  Pavel Zezula,et al.  Towards Fast Multimedia Feature Extraction: Hadoop or Storm , 2014, 2014 IEEE International Symposium on Multimedia.

[33]  Jiqiang Liu,et al.  Constructing important features from massive network traffic for lightweight intrusion detection , 2015, IET Inf. Secur..

[34]  Hannes Federrath,et al.  Website fingerprinting: attacking popular privacy enhancing technologies with the multinomial naïve-bayes classifier , 2009, CCSW '09.

[35]  Tomás Pevný,et al.  Towards dependable steganalysis , 2015, Electronic Imaging.

[36]  Andrew W. Moore,et al.  Bayesian Neural Networks for Internet Traffic Classification , 2007, IEEE Transactions on Neural Networks.

[37]  Nick Cercone,et al.  Efficient mining of frequent itemsets in social network data based on MapReduce framework , 2013, 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2013).

[38]  Nick Sullivan,et al.  The Security Impact of HTTPS Interception , 2017, NDSS.

[39]  Jürgen Schönwälder,et al.  Flow signatures of popular applications , 2011, 12th IFIP/IEEE International Symposium on Integrated Network Management (IM 2011) and Workshops.

[40]  Luca Salgarelli,et al.  Support Vector Machines for TCP traffic classification , 2009, Comput. Networks.

[41]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[42]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[43]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[44]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[45]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[46]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.