A SERP-Mining Approach for Classification of DNS Requests

DNS request classification is an area that has received a lot of attention, mostly as part of network security process, in order to classify requests into malicious and non-malicious. However, there exist several categories of web pages that even though not malicious, they belong to “borderline” categories and need to be monitored. For instance, websites selling illegal substances or weapons might be of interest for any public or private organization to monitor as outgoing traffic. In this work, we treat this as a topic classification problem. We present and evaluate a machine learning framework that takes as input a domain name (based on the respective DNS request) and outputs the content category it belongs to. We evaluate several options for feature engineering and classification to find the most appropriate setup for the specific problem domain. We also address the problem of data collection and preprocessing. While there exist several labelled datasets with malicious/non-malicious requests, a similar labelled dataset does not exist for general web content categories. We therefore propose a SERP (Search Engine Response Pages)-mining approach to collect and label an appropriate dataset. Our experimental evaluation uncovers several interesting insights and forms the basis for further work into this interesting domain.

[1]  Eddy Mayoraz Multiclass Classification with Pairwise Coupled Neural Networks or Support Vector Machines , 2001, ICANN.

[2]  Robert P. W. Duin,et al.  Uniform Object Generation for Optimizing One-class Classifiers , 2002, J. Mach. Learn. Res..

[3]  Timothy N. Rubin,et al.  Statistical topic models for multi-label document classification , 2011, Machine Learning.

[4]  Xue Sun,et al.  Multi-class text categorization based on LDA and SVM , 2011 .

[5]  Wenke Lee,et al.  Detecting Malware Domains at the Upper DNS Hierarchy , 2011, USENIX Security Symposium.

[6]  Naftali Tishby,et al.  Unsupervised document classification using sequential information maximization , 2002, SIGIR '02.

[7]  Martine De Cock,et al.  Inline DGA Detection with Deep Networks , 2017, 2017 IEEE International Conference on Data Mining Workshops (ICDMW).

[8]  Ohad Shamir,et al.  Multiclass-Multilabel Classification with More Classes than Examples , 2010, AISTATS.

[9]  Malik Yousef,et al.  One-Class SVMs for Document Classification , 2002, J. Mach. Learn. Res..

[10]  Leyla Bilge,et al.  EXPOSURE: Finding Malicious Domains Using Passive DNS Analysis , 2011, NDSS.

[11]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[12]  Hahn-Ming Lee,et al.  Multi-class SVM with negative data selection for Web page classification , 2004, 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541).

[13]  George Karypis,et al.  Centroid-Based Document Classification: Analysis and Experimental Results , 2000, PKDD.

[14]  Yaw-Huei Chen,et al.  Using latent Dirichlet allocation to improve text classification performance of support vector machine , 2016, 2016 IEEE Congress on Evolutionary Computation (CEC).

[15]  Malik Yousef,et al.  One-class document classification via Neural Networks , 2007, Neurocomputing.

[16]  Kevin Chen-Chuan Chang,et al.  PEBL: positive example based learning for Web page classification using SVM , 2002, KDD.