Efficient Mining of the Multidimensional Traffic Cluster Hierarchy for Digesting, Visualization, and Anomaly Identification

Mining traffic to identify the dominant flows sent over a given link, over a specified time interval, is a valuable capability with applications to traffic auditing, simulation, visualization, as well as anomaly detection. Recently, Estan advanced a comprehensive data mining structure tailored for networking data-a parsimonious, multidimensional flow hierarchy, along with an algorithm for its construction. While they primarily targeted offline auditing, use in interactive traffic visualization and anomaly/attack detection will require real-time data mining. We suggest several improvements to Estan's algorithm that substantially reduce the computational complexity of multidimensional flow mining. We also propose computational and memory-efficient approaches for unidimensional clustering of the IP address spaces. For baseline implementations, evaluated on the New Zealand (NZIX) trace data, our method reduced CPU execution times of the Estan method by a factor of more than eight. We also develop a methodology for anomaly/attack detection based on flow mining, demonstrating the usefulness of this approach on traces from the Slammer and Code Red worms and the MIT Lincoln Laboratories DDoS data

[1]  Noga Alon,et al.  The Space Complexity of Approximating the Frequency Moments , 1999 .

[2]  Jiawei Han,et al.  Discovery of Multiple-Level Association Rules from Large Databases , 1995, VLDB.

[3]  Dan Schnackenberg,et al.  Statistical approaches to DDoS attack detection and response , 2003, Proceedings DARPA Information Survivability Conference and Exposition.

[4]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[5]  George Varghese,et al.  Automated Worm Fingerprinting , 2004, OSDI.

[6]  Zhuoqing Morley Mao,et al.  Toward understanding distributed blackhole placement , 2004, WORM '04.

[7]  Ke Wang,et al.  Mining frequent item sets by opportunistic projection , 2002, KDD.

[8]  B. Karp,et al.  Autograph: Toward Automated, Distributed Worm Signature Detection , 2004, USENIX Security Symposium.

[9]  Abhishek Kumar,et al.  Data streaming algorithms for efficient and accurate estimation of flow size distribution , 2004, SIGMETRICS '04/Performance '04.

[10]  ZhangZhi-Li,et al.  Profiling internet backbone traffic , 2005 .

[11]  Stefan Savage,et al.  The Spread of the Sapphire/Slammer Worm , 2003 .

[12]  Divesh Srivastava,et al.  Diamond in the rough: finding Hierarchical Heavy Hitters in multi-dimensional data , 2004, SIGMOD '04.

[13]  Carsten Lund,et al.  Online identification of hierarchical heavy hitters: algorithms, evaluation, and applications , 2004, IMC '04.

[14]  John Heidemann,et al.  Detecting Early Worm Propagation through Packet Matching , 2004 .

[15]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[16]  Hari Balakrishnan,et al.  Fast portscan detection using sequential hypothesis testing , 2004, IEEE Symposium on Security and Privacy, 2004. Proceedings. 2004.

[17]  Mark Crovella,et al.  Mining anomalies using traffic feature distributions , 2005, SIGCOMM '05.

[18]  John Heidemann,et al.  On the correlation of Internet flow characteristics , 2003 .

[19]  Carsten Lund,et al.  Charging from sampled network usage , 2001, IMW '01.

[20]  James Newsome,et al.  Polygraph: automatically generating signatures for polymorphic worms , 2005, 2005 IEEE Symposium on Security and Privacy (S&P'05).

[21]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[22]  Zhi-Li Zhang,et al.  Profiling internet backbone traffic: behavior models and applications , 2005, SIGCOMM '05.

[23]  Rajeev Motwani,et al.  Approximate Frequency Counts over Data Streams , 2012, VLDB.

[24]  Vern Paxson,et al.  Proceedings of the 13th USENIX Security Symposium , 2022 .

[25]  Graham Cormode,et al.  What's new: finding significant differences in network data streams , 2004, INFOCOM 2004.

[26]  George Varghese,et al.  Automatically inferring patterns of resource consumption in network traffic , 2003, SIGCOMM '03.

[27]  Allen Gersho,et al.  Vector quantization and signal compression , 1991, The Kluwer international series in engineering and computer science.

[28]  DiotChristophe,et al.  Mining anomalies using traffic feature distributions , 2005 .