Vector representation of internet domain names using a word embedding technique

Word embeddings is a well known set of techniques widely used in natural language processing (NLP), and word2vec is a computationally-efficient predictive model to learn such embeddings. This paper explores the use of word embeddings in a new scenario. We create a vector representation of Internet Domain Names (DNS) by taking the core ideas from NLP techniques and applying them to real anonymized DNS log queries from a large Internet Service Provider (ISP). Our main objective is to find semantically similar domains only using information of DNS queries without any other previous knowledge about the content of those domains. We use the word2vec unsupervised learning algorithm with a Skip-Gram model to create the embeddings. And we validate the quality of our results by expert visual inspection of similarities, and by comparing them with a third party source, namely, similar sites service offered by Alexa Internet, Inc.

[1]  Qiang Ma,et al.  App2Vec: Vector modeling of mobile apps and applications , 2016, 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM).

[2]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[3]  Wen Zhang,et al.  How much can behavioral targeting help online advertising? , 2009, WWW '09.

[4]  Isabelle Augenstein,et al.  emoji2vec: Learning Emoji Representations from their Description , 2016, SocialNLP@EMNLP.

[5]  Kaichao Wu,et al.  Data Mining-based DNS Log Analysis , 2014 .

[6]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[7]  Omer Levy,et al.  word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method , 2014, ArXiv.

[8]  Stephen E. Robertson,et al.  A new interpretation of average precision , 2008, SIGIR '08.

[9]  Shiyang Chen,et al.  Visualizing and characterizing DNS lookup behaviors via log-mining , 2015, Neurocomputing.

[10]  Peter B. Danzig,et al.  An analysis of wide-area name server traffic: a study of the Internet Domain Name System , 1992, SIGCOMM '92.

[11]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[12]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[13]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[14]  Jimeng Sun,et al.  Multi-layer Representation Learning for Medical Concepts , 2016, KDD.

[15]  Ying Liu,et al.  Pattern Discovery in DNS Query Traffic , 2013, ITQM.

[16]  Ravi Sundaram,et al.  Preprocessing DNS Log Data for Effective Data Mining , 2009, 2009 IEEE International Conference on Communications.

[17]  Martin Wattenberg,et al.  Ad click prediction: a view from the trenches , 2013, KDD.

[18]  Paul Albitz,et al.  DNS and BIND , 1994 .

[19]  Alexander J. Smola,et al.  Scalable distributed inference of dynamic user interests for behavioral targeting , 2011, KDD.