Data summarization for network traffic monitoring

Network traffic monitoring is a very difficult task, given the amount of network traffic generated even in small networks. One approach to facilitate this task is network traffic summarization. Data summarization is a key concept in data mining. However, no current measures exist in order to facilitate the evaluation of summaries. This paper presents four metrics which can be used to characterize data summarization results. Conciseness and Information Loss have already been defined, but we modified Information Loss, due to the fact that it was biased towards recurring attributes across individual summaries. We also propose two additional metrics, Interestingness and Intelligibility. Using the proposed metrics, we evaluated existing summarization techniques on well known network traffic datasets. We also proposed a summarization technique, based on an existing one but incorporating the proposed metrics as objective function. In order to further demonstrate the usability of the metrics, we performed classification on summarized datasets, showing that the metrics can be used to facilitate the selection of summaries for performing data mining. Using the summarized datasets with a reasonable conciseness, we were able to achieve similar results in terms of accuracy, but at a fraction of the running time, proportional to the conciseness of the summarized dataset.

[1]  Andrew W. Moore,et al.  Internet traffic classification using bayesian analysis techniques , 2005, SIGMETRICS '05.

[2]  Tijl De Bie,et al.  A framework for mining interesting pattern sets , 2010, SIGKDD Explor..

[3]  Stephen D. Bay,et al.  The UCI KDD archive of large data sets for data mining research and experimentation , 2000, SKDD.

[4]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[5]  Andrew W. Moore,et al.  Discriminators for use in flow-based classification , 2013 .

[6]  Chaim Zins Conceptual approaches for defining data, information, and knowledge: Research Articles , 2007 .

[7]  Anthony K. H. Tung,et al.  ItCompress: an iterative semantic compression algorithm , 2004, Proceedings. 20th International Conference on Data Engineering.

[8]  Christopher Leckie,et al.  Unsupervised Anomaly Detection in Network Intrusion Detection Using Clusters , 2005, ACSC.

[9]  Anil K. Jain Data clustering: 50 years beyond K-means , 2010, Pattern Recognit. Lett..

[10]  Rongbo Zhu,et al.  Intelligent rate control for supporting real-time traffic in WLAN mesh networks , 2011, J. Netw. Comput. Appl..

[11]  Marco Canini,et al.  Efficient application identification and the temporal and spatial stability of classification schema , 2009, Comput. Networks.

[12]  Chaim Zins,et al.  Conceptual approaches for defining data, information, and knowledge , 2007, J. Assoc. Inf. Sci. Technol..

[13]  Padmini Srinivasan,et al.  A quality-threshold data summarization algorithm , 2008, 2008 IEEE International Conference on Research, Innovation and Vision for the Future in Computing and Communication Technologies.

[14]  Sami Faïz,et al.  On Mining Summaries by Objective Measures of Interestingness , 2006, Machine Learning.

[15]  Andrew W. Moore,et al.  Bayesian Neural Networks for Internet Traffic Classification , 2007, IEEE Transactions on Neural Networks.

[16]  Biming Tian,et al.  Anomaly detection in wireless sensor networks: A survey , 2011, J. Netw. Comput. Appl..

[17]  Sushil Jajodia,et al.  Data warehousing and data mining techniques for intrusion detection systems , 2006, Distributed and Parallel Databases.

[18]  Vipin Kumar,et al.  MINDS: Architecture & Design , 2007 .

[19]  Jilles Vreeken,et al.  Krimp: mining itemsets that compress , 2011, Data Mining and Knowledge Discovery.

[20]  Howard J. Hamilton,et al.  Principles for mining summaries using objective measures of interestingness , 2000, Proceedings 12th IEEE Internationals Conference on Tools with Artificial Intelligence. ICTAI 2000.

[21]  A. Odlyzko,et al.  Internet growth: is there a Moore's law for data traffic? , 2000 .

[22]  Kenneth McGarry,et al.  A survey of interestingness measures for knowledge discovery , 2005, The Knowledge Engineering Review.

[23]  Noureddine Mouaddib,et al.  General Purpose Database Summarization , 2005, VLDB.

[24]  Ming-Yang Su,et al.  Using clustering to improve the KNN-based classifiers for online anomaly network traffic identification , 2011, J. Netw. Comput. Appl..

[25]  G.P.S. Junior,et al.  P2P Traffic Identification using Cluster Analysis , 2007, 2007 First International Global Information Infrastructure Symposium.

[26]  Kavé Salamatian,et al.  Anomaly extraction in backbone networks using association rules , 2009, IMC '09.

[27]  J HamiltonHoward,et al.  Interestingness measures for data mining , 2006 .

[28]  Tao Qin,et al.  Monitoring abnormal network traffic based on blind source separation approach , 2011, J. Netw. Comput. Appl..

[29]  Jilles Vreeken,et al.  Summarising Data by Clustering Items , 2010, ECML/PKDD.

[30]  Panagiotis Karras,et al.  Multiplicative synopses for relative-error metrics , 2009, EDBT '09.

[31]  Dah-Ming Chiu,et al.  Entropy based adaptive flow aggregation , 2009, TNET.

[32]  Marimuthu Palaniswami,et al.  Anomaly detection in wireless sensor networks , 2008, IEEE Wireless Communications.

[33]  Li Wei,et al.  Network Traffic Classification Using K-means Clustering , 2007 .

[34]  Jingwen Tian,et al.  Intrusion Detection Method Based on Classify Support Vector Machine , 2009, 2009 Second International Conference on Intelligent Computation Technology and Automation.

[35]  CormodeGraham,et al.  An improved data stream summary , 2005 .

[36]  David Moore,et al.  A robust system for accurate real-time summaries of internet traffic , 2005, SIGMETRICS '05.

[37]  Philip S. Yu,et al.  On High Dimensional Projected Clustering of Data Streams , 2005, Data Mining and Knowledge Discovery.

[38]  Rajeev Rastogi,et al.  SPARTAN: a model-based semantic compression system for massive data tables , 2001, SIGMOD '01.

[39]  Fuji Ren,et al.  A study on cross-language text summarization using supervised methods , 2009, 2009 International Conference on Natural Language Processing and Knowledge Engineering.

[40]  Howard J. Hamilton,et al.  Interestingness measures for data mining: A survey , 2006, CSUR.

[41]  Ron Kohavi,et al.  The Power of Decision Tables , 1995, ECML.

[42]  Yang Xiang,et al.  Summarizing transactional databases with overlapped hyperrectangles , 2011, Data Mining and Knowledge Discovery.

[43]  Kwong-Sak Leung,et al.  Scalable model-based clustering for large databases based on data summarization , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Tobias Bjerregaard,et al.  A survey of research and practices of Network-on-chip , 2006, CSUR.

[45]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[46]  Vipin Kumar,et al.  Summarization – compressing data into an informative representation , 2006, Knowledge and Information Systems.

[47]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[48]  Xiaozhe Wang,et al.  Intelligent web traffic mining and analysis , 2005, J. Netw. Comput. Appl..