Probabilistic top-k query: Model and application on web traffic analysis

Top-k ranking of websites according to traffic volume is important for Internet Service Providers (ISPs) to understand network status and optimize network resources. However, the ranking result always has a big deviation with actual rank for the existence of unknown web traffic, which cannot be identified accurately under current techniques. In this paper, we introduce a novel method to approximate the actual rank. This method associates unknown web traffic with websites according to statistical probabilities. Then, we construct a probabilistic top-k query model to rank websites. We conduct several experiments by using real HTTP traffic traces collected from a commercial ISP covering an entire city in northern China. Experimental results show that the proposed techniques can reduce the deviation existing between the ground truth and the ranking results vastly. In addition, we find that the websites providing video service have higher ratio of unknown IP as well as higher ratio of unknown traffic than the websites providing text web page service. Specifically, we find that the top-3 video websites have more than 90% of unknown web traffic. All these findings are helpful for ISPs understanding network status and deploying Content Distributed Network (CDN).

[1]  Anja Feldmann,et al.  On dominant characteristics of residential broadband internet traffic , 2009, IMC '09.

[2]  Chen Yuehui,et al.  How many packets are most effective for early stage traffic identification: An experimental study , 2014, China Communications.

[3]  Xiaohong Guan,et al.  Accurate Classification of the Internet Traffic Based on the SVM Method , 2007, 2007 IEEE International Conference on Communications.

[4]  Li Wen Top-K Query Processing Techniques on Uncertain Data , 2012 .

[5]  Robert Doverspike,et al.  Traffic types and growth in backbone networks , 2011, 2011 Optical Fiber Communication Conference and Exposition and the National Fiber Optic Engineers Conference.

[6]  Ronald Fagin,et al.  Comparing top k lists , 2003, SODA '03.

[7]  Kuai Xu,et al.  Behavior Analysis of Internet Traffic via Bipartite Graphs and One-Mode Projections , 2014, IEEE/ACM Trans. Netw..

[8]  Du Min,et al.  Online Internet traffic identification algorithm based on multistage classifier , 2013, China Communications.

[9]  Andrew W. Moore,et al.  Bayesian Neural Networks for Internet Traffic Classification , 2007, IEEE Transactions on Neural Networks.

[10]  Yan Meng,et al.  Classification of unknown mobile web traffic based on correlation coefficient measurement , 2014, 2014 International Symposium on Wireless Personal Multimedia Communications (WPMC).

[11]  Michalis Faloutsos,et al.  BLINC: multilevel traffic classification in the dark , 2005, SIGCOMM '05.

[12]  Wei Li,et al.  Classifying HTTP Traffic in the New Age , 2008, SIGCOMM 2008.

[13]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[14]  Anja Feldmann,et al.  Pitfalls in HTTP Traffic Measurements and Analysis , 2012, PAM.

[15]  James Won-Ki Hong,et al.  Toward fine-grained traffic classification , 2011, IEEE Communications Magazine.

[16]  George Varghese,et al.  Network monitoring using traffic dispersion graphs (tdgs) , 2007, IMC '07.

[17]  Pierdomenico Fiadino,et al.  HTTPtag: a flexible on-line HTTP classification system for operational 3g networks , 2013, 2013 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS).

[18]  Xiaozhe Wang,et al.  Intelligent web traffic mining and analysis , 2005, J. Netw. Comput. Appl..

[19]  Santo Fortunato,et al.  Ranking web sites with real user traffic , 2008, WSDM '08.

[20]  Arian Bär,et al.  IP mining: Extracting knowledge from the dynamics of the Internet addressing space , 2013, Proceedings of the 2013 25th International Teletraffic Congress (ITC).