Internet Traffic Analysis Using Community Detection and Apache Spark

With the rapid development of Internet, Internet traffic and end hosts continue to grow in size. Traffic behavior analysis for a large-scale network is becoming more and more difficult. To address these challenges, this paper proposes an Internet traffic analysis approach based on community detection to discover community consisted of end hosts with similar traffic behavior in a large campus network. First, we use only the IP-to-IP information without packet payloads to model the similarity of end hosts in campus network. Then the similarity graph which represent the social behavior similarity of all end hosts is constructed. Finally, we leverage Label Propagation algorithm to discover end hosts community on the similarity graph. To satisfy demands for the scalable analysis of ever-growing Internet traffic data, a Spark-based Internet traffic analysis system is developed, including implementing the above algorithm. The experimental results based on real campus network traffic show the benefits of the proposed approach in analyzing traffic behavior of a large-scale network on host community level and detecting potential anomalous traffic behavior. The proposed approach reduces the complexity of analyzing the traffic behavior of a large network compare with analyzing individual host. In addition, the experimental results also demonstrate the Spark-based Internet traffic analysis system can analyze Internet traffic efficiently.

[1]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[2]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[3]  Peter Boncz,et al.  First International Workshop on Graph Data Management Experiences and Systems , 2013, SIGMOD 2013.

[4]  Kuai Xu,et al.  Internet Traffic Behavior Profiling for Network Security Monitoring , 2008, IEEE/ACM Transactions on Networking.

[5]  Joseph K. Bradley,et al.  Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.

[6]  Zhi-Li Zhang,et al.  Profiling internet backbone traffic: behavior models and applications , 2005, SIGCOMM '05.

[7]  Michalis Faloutsos,et al.  Profiling the End Host , 2007, PAM.

[8]  Youngseok Lee,et al.  Toward scalable internet traffic measurement and analysis with Hadoop , 2013, CCRV.

[9]  Zoubin Ghahramani,et al.  Learning from labeled and unlabeled data with label propagation , 2002 .

[10]  Kuai Xu,et al.  Network-aware behavior clustering of Internet end hosts , 2011, 2011 Proceedings IEEE INFOCOM.

[11]  George Bebis,et al.  A survey of network flow applications , 2013, J. Netw. Comput. Appl..

[12]  Réka Albert,et al.  Near linear time algorithm to detect community structures in large-scale networks. , 2007, Physical review. E, Statistical, nonlinear, and soft matter physics.

[13]  Reynold Xin,et al.  GraphX: a resilient distributed graph system on Spark , 2013, GRADES.

[14]  Scott Shenker,et al.  Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters , 2012, HotCloud.

[15]  Erhan Guven,et al.  A Survey of Data Mining and Machine Learning Methods for Cyber Security Intrusion Detection , 2016, IEEE Communications Surveys & Tutorials.

[16]  Steve Romig,et al.  The OSU Flow-tools Package and CISCO NetFlow Logs , 2000, LISA.

[17]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.