Modeling skew in data streams

Data stream applications have made use of statistical summaries to reason about the data using nonparametric tools such as histograms, heavy hitters, and join sizes. However, relatively little attention has been paid to modeling stream data parametrically, despite the potential this approach has for mining the data. The challenges to do model fitting at streaming speeds are both technical -- how to continually find fast and reliable parameter estimates on high speed streams of skewed data using small space -- and conceptual -- how to validate the goodness-of-fit and stability of the model online.In this paper, we show how to fit hierarchical (binomial multifractal) and non-hierarchical (Pareto) power-law models on a data stream. We address the technical challenges using an approach that maintains a sketch of the data stream and fits least-squares straight lines; it yields algorithms that are fast, space-efficient, and provide approximations of parameter value estimates with a priori quality guarantees relative to those obtained offline. We address the conceptual challenge by designing fast methods for online goodness-of-fit measurements on a data stream; we adapt the statistical testing technique of examining the quantile-quantile (q-q) plot, to perform online model validation at streaming speeds.As a concrete application of our techniques, we focus on network traffic data which has been shown to exhibit skewed distributions. We complement our analytic and algorithmic results with experiments on IP traffic streams in AT&T's Gigascope® data stream management system, to demonstrate practicality of our methods at line speeds. We measured the stability and robustness of these models over weeks of operational packet data in an IP network. In addition, we study an intrusion detection application, and demonstrate the potential of online parametric modeling.

[1]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[2]  Edith Cohen,et al.  Maintaining time-decaying stream aggregates , 2006, J. Algorithms.

[3]  R. E. Wheeler Statistical distributions , 1983, APLQ.

[4]  V. Paxson,et al.  WHERE MATHEMATICS MEETS THE INTERNET , 1998 .

[5]  M. Evans Statistical Distributions , 2000 .

[6]  Sidney I. Resnick,et al.  Heavy Tail Modelling and Teletraffic Data , 1995 .

[7]  Divesh Srivastava,et al.  On computing correlated aggregates over continual data streams , 2001, SIGMOD '01.

[8]  Christos Faloutsos,et al.  Fast estimation of fractal dimension and correlation integral on stream data , 2005, Inf. Process. Lett..

[9]  Walter Willinger,et al.  Self-similarity and heavy tails: structural modeling of network traffic , 1998 .

[10]  Philippe Flajolet,et al.  Probabilistic counting , 1983, 24th Annual Symposium on Foundations of Computer Science (sfcs 1983).

[11]  Graham Cormode,et al.  Summarizing and Mining Skewed Data Streams , 2005, SDM.

[12]  Theodore Johnson,et al.  Gigascope: a stream database for network applications , 2003, SIGMOD '03.

[13]  Divesh Srivastava,et al.  Holistic UDAFs at streaming speeds , 2004, SIGMOD '04.

[14]  Balachander Krishnamurthy,et al.  ATMEN: a triggered network measurement infrastructure , 2005, WWW '05.

[15]  V. Plerou,et al.  A theory of power-law distributions in financial market fluctuations , 2003, Nature.

[16]  Eddie Kohler,et al.  Observed Structure of Addresses in IP Traffic , 2002, IEEE/ACM Transactions on Networking.

[17]  Christos Faloutsos,et al.  Modeling Skewed Distribution Using Multifractals and the '80-20' Law , 1996, VLDB.

[18]  Benoit B. Mandelbrot,et al.  Fractals and Scaling in Finance , 1997 .

[19]  Matthew Roughan,et al.  Pragmatic modeling of broadband access traffic , 2003, Comput. Commun..

[20]  Walter Willinger,et al.  A pragmatic approach to dealing with high-variability in network measurements , 2004, IMC '04.

[21]  Azer Bestavros,et al.  Changes in Web client access patterns: Characteristics and caching implications , 1999, World Wide Web.

[22]  Christos Faloutsos,et al.  Estimating the Selectivity of Spatial Queries Using the 'Correlation' Fractal Dimension , 1995, VLDB.

[23]  Christos Faloutsos,et al.  Data mining meets performance evaluation: fast algorithms for modeling bursty traffic , 2002, Proceedings 18th International Conference on Data Engineering.

[24]  Christos Faloutsos,et al.  The "DGX" distribution for mining massive, skewed data , 2001, KDD '01.

[25]  Anja Feldmann,et al.  Scaling Analysis of Conservative Cascades, with Applications to Network Traffic , 1999, IEEE Trans. Inf. Theory.

[26]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[27]  Wei Hong,et al.  Model-Driven Data Acquisition in Sensor Networks , 2004, VLDB.

[28]  Feifei Li,et al.  Characterizing and Exploiting Reference Locality in Data Stream Applications , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[29]  Piotr Indyk,et al.  Maintaining stream statistics over sliding windows: (extended abstract) , 2002, SODA '02.

[30]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[31]  Christos Faloutsos,et al.  LOCI: fast outlier detection using the local correlation integral , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[32]  Manfred Schroeder,et al.  Fractals, Chaos, Power Laws: Minutes From an Infinite Paradise , 1992 .

[33]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[34]  Gurmeet Singh Manku,et al.  Approximate counts and quantiles over sliding windows , 2004, PODS.

[35]  Christos Faloutsos,et al.  Beyond uniformity and independence: analysis of R-trees using the concept of fractal dimension , 1994, PODS.

[36]  Piotr Indyk,et al.  Maintaining Stream Statistics over Sliding Windows , 2002, SIAM J. Comput..