Network Traffic Classification Using K-means Clustering

Network traffic classification and application identification provide important benefits for IP network engineering, management and control and other key domains. Current popular methods, such as port-based and payload-based, have shown some disadvantages, and the machine learning based method is a potential one. The traffic is classified according to the payload-independent statistical characters. This paper introduces the different levels in network traffic-analysis and the relevant knowledge in machine learning domain, analysis the problems of port-based and payload-based methods in traffic classification. Considering the priority of the machine learning-based method, we experiment with unsupervised K-means to evaluate the efficiency and performance. We adopt feature selection to find an optimal feature set and log transformation to improve the accuracy. The experimental results on different datasets convey that the method can obtain up to 80% overall accuracy, and, after a log transformation, the accuracy is improved to 90% or more.

[1]  D. Eisenberg,et al.  A combined algorithm for genome-wide prediction of protein function , 1999, Nature.

[2]  Temple F. Smith,et al.  The challenges of genome sequence annotation or “The devil is in the details” , 1997, Nature Biotechnology.

[3]  Hans Lehrach,et al.  Automated Gene Ontology annotation for anonymous sequence data , 2003, Nucleic Acids Res..

[4]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[5]  Sebastian Zander,et al.  Automated traffic classification and application identification using machine learning , 2005, The IEEE Conference on Local Computer Networks 30th Anniversary (LCN'05)l.

[6]  Roland Eils,et al.  Applying Support Vector Machines for Gene ontology based gene function prediction , 2004, BMC Bioinformatics.

[7]  Avi Shoshan,et al.  Large-scale protein annotation through gene ontology. , 2002, Genome research.

[8]  Matthew Roughan,et al.  Class-of-service mapping for QoS: a statistical signature-based approach to IP traffic classification , 2004, IMC '04.

[9]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[10]  Yan P. Yuan,et al.  Predicting function: from genes to genomes and back. , 1998, Journal of molecular biology.

[11]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[12]  Lothar Reichel,et al.  The relationship between protein sequences and their gene ontology functions , 2006, First International Multi-Symposiums on Computer and Computational Sciences (IMSCCS'06).

[13]  Walter L. Ruzzo,et al.  A Regression-based K nearest neighbor algorithm for gene function prediction from heterogeneous data , 2006, BMC Bioinformatics.

[14]  Günther Zehetner,et al.  OntoBlast function: from sequence similarities directly to potential functional annotations by ontology terms , 2003, Nucleic Acids Res..

[15]  P. Bork,et al.  Predicting functions from protein sequences—where are the bottlenecks? , 1998, Nature Genetics.

[16]  A Bairoch,et al.  Protein annotation: detective work for function prediction. , 1998, Trends in genetics : TIG.

[17]  J. Schug,et al.  Predicting gene ontology functions from ProDom and CDD protein domains. , 2002, Genome research.

[18]  N. P. Brown,et al.  The GeneQuiz web server: protein functional analysis through the Web. , 2000, Trends in biochemical sciences.

[19]  Anja Feldmann,et al.  An analysis of Internet chat systems , 2003, IMC '03.

[20]  Renata Teixeira,et al.  Traffic classification on the fly , 2006, CCRV.

[21]  Geoffrey J. Barton,et al.  GOtcha: a new method for prediction of protein function assessed by the annotation of seven genomes , 2004, BMC Bioinformatics.

[22]  Dong Xu,et al.  Genome-Scale Protein Function Prediction in Yeast Saccharomyces cerevisiae Through Integrating Multiple Sources of High-Throughput Data , 2005, Pacific Symposium on Biocomputing.

[23]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[24]  Philippe Owezarski,et al.  Modeling Internet backbone traffic at the flow level , 2003, IEEE Trans. Signal Process..

[25]  R. Overbeek,et al.  The use of gene clusters to infer functional coupling. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[26]  Simon Kasif,et al.  Probabilistic Protein Function Prediction from Heterogeneous Genome-Wide Data , 2007, PloS one.

[27]  Geoff Holmes,et al.  Benchmarking Attribute Selection Techniques for Discrete Class Data Mining , 2003, IEEE Trans. Knowl. Data Eng..

[28]  Søren Brunak,et al.  Prediction of human protein function according to Gene Ontology categories , 2003, Bioinform..

[29]  Yudong D. He,et al.  Functional Discovery via a Compendium of Expression Profiles , 2000, Cell.

[30]  Andrew W. Moore,et al.  Internet traffic classification using bayesian analysis techniques , 2005, SIGMETRICS '05.

[31]  Takashi Matsumoto,et al.  RiceGAAS: an automated annotation system and database for rice genome sequence , 2002, Nucleic Acids Res..

[32]  Patrick Haffner,et al.  ACAS: automated construction of application signatures , 2005, MineNet '05.

[33]  Andrew W. Moore,et al.  Architecture of a network monitor , 2003 .

[34]  Zhang Hui,et al.  A methodology for analyzing backbone network traffic at stream-level , 2003, International Conference on Communication Technology Proceedings, 2003. ICCT 2003..

[35]  Alfonso Valencia,et al.  Automatic annotation of protein function based on family identification , 2003, Proteins.

[36]  Jeffrey Erman,et al.  Internet Traffic Identification using Machine Learning , 2006 .

[37]  T. Hughes,et al.  Signaling and circuitry of multiple MAPK pathways revealed by a matrix of global gene expression profiles. , 2000, Science.

[38]  Anirban Mahanti,et al.  Traffic classification using clustering algorithms , 2006, MineNet '06.

[39]  Chase Cotton,et al.  Packet-level traffic measurements from the Sprint IP backbone , 2003, IEEE Netw..

[40]  Kaisheng Chen,et al.  In silico gene function prediction using ontology-based pattern identification , 2005, Bioinform..

[41]  C. Sander,et al.  Challenging times for bioinformatics , 1995, Nature.

[42]  S. Altschul,et al.  Issues in searching molecular sequence databases , 1994, Nature Genetics.

[43]  A. Owen,et al.  A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae) , 2003, Proceedings of the National Academy of Sciences of the United States of America.