Classification of content and users in BitTorrent by semi-supervised learning methods

P2P downloads still represent a large portion of today's Internet traffic. More than 100 million users operate BitTorrent and generate more than 30% of the total Internet traffic. Recently, a significant research effort has been done to develop tools for automatic classification of Internet traffic by application. The purpose of the present work is to provide a framework for subclassification of P2P traffic generated by the BitTorrent protocol. The general intuition is that the users with similar interests download similar contents. This intuition can be rigorously formalized with the help of graph based semi-supervised learning approach. We have chosen to work with a PageRank based semi-supervised learning method, which scales well with very large volumes of data. We provide recommendations for the choice of parameters in the PageRank based semi-supervised learning method. In particular, we show that it is advantageous to choose labelled points with large PageRank score.

[1]  Bernhard Schölkopf,et al.  Learning with Local and Global Consistency , 2003, NIPS.

[2]  Sebastiano Vigna,et al.  The webgraph framework I: compression techniques , 2004, WWW '04.

[3]  Cleve B. Moler,et al.  Numerical computing with MATLAB , 2004 .

[4]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[5]  Andrew W. Moore,et al.  A Machine Learning Approach for Efficient Traffic Classification , 2007, 2007 15th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems.

[6]  Gene H. Golub,et al.  Netlib and NA-Net: Building a Scientific Computing Community , 2008, IEEE Annals of the History of Computing.

[7]  Konstantin Avrachenkov,et al.  Pagerank based clustering of hypertext document collections , 2008, SIGIR '08.

[8]  Marco Canini,et al.  Efficient application identification and the temporal and spatial stability of classification schema , 2009, Comput. Networks.

[9]  Guillaume Urvoy-Keller,et al.  Challenging statistical classification for operational usage: the ADSL case , 2009, IMC '09.

[10]  Walid Dabbous,et al.  Spying the World from Your Laptop: Identifying and Profiling Content Providers and Big Downloaders in BitTorrent , 2010, LEET.

[11]  Mark E. J. Newman,et al.  An efficient and principled method for detecting communities in networks , 2011, Physical review. E, Statistical, nonlinear, and soft matter physics.

[12]  Konstantin Avrachenkov,et al.  Generalized Optimization Framework for Graph-based Semi-supervised Learning , 2011, SDM.