Change (Detection) You Can Believe in: Finding Distributional Shifts in Data Streams

Data streams are dynamic, with frequent distributional changes. In this paper, we propose a statistical approach to detecting distributional shifts in multi-dimensional data streams. We use relative entropy, also known as the Kullback-Leibler distance, to measure the statistical distance between two distributions. In the context of a multi-dimensional data stream, the distributions are generated by data from two sliding windows. We maintain a sample of the data from the stream inside the windows to build the distributions. Our algorithm is streaming, nonparametric, and requires no distributional or model assumptions. It employs the statistical theory of hypothesis testing and bootstrapping to determine whether the distributions are statistically different. We provide a full suite of experiments on synthetic data to validate the method and demonstrate its effectiveness on data from real-life applications.

[1]  Don H. Johnson,et al.  Information-theoretic analysis of neural coding , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[2]  Imre Csiszár,et al.  Information Theory - Coding Theorems for Discrete Memoryless Systems, Second Edition , 2011 .

[3]  Sanjay Ranka,et al.  Statistical change detection for multi-dimensional data , 2007, KDD '07.

[4]  Don H. Johnson,et al.  Information-Theoretic Analysis of Neural Coding , 2004, Journal of Computational Neuroscience.

[5]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[6]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[7]  Jon M. Kleinberg,et al.  Bursty and Hierarchical Structure in Streams , 2002, Data Mining and Knowledge Discovery.

[8]  S. Venkatasubramanian,et al.  An Information-Theoretic Approach to Detecting Changes in Multi-Dimensional Data Streams , 2006 .

[9]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[10]  Michael Gutman,et al.  Asymptotically optimal classification for multiple tests with empirically observed statistics , 1989, IEEE Trans. Inf. Theory.

[11]  S. T. Buckland,et al.  An Introduction to the Bootstrap. , 1994 .

[12]  Dimitrios Gunopulos,et al.  Online outlier detection in sensor data using non-parametric models , 2006, VLDB.

[13]  Charu C. Aggarwal,et al.  A framework for diagnosing changes in evolving data streams , 2003, SIGMOD '03.

[14]  Jennifer Widom,et al.  Representing and querying changes in semistructured data , 1998, Proceedings 14th International Conference on Data Engineering.

[15]  Rajeev Motwani,et al.  Sampling from a moving window over streaming data , 2002, SODA '02.

[16]  Johannes Gehrke,et al.  A Framework for Measuring Differences in Data Characteristics , 2002, J. Comput. Syst. Sci..

[17]  Johannes Gehrke,et al.  A framework for measuring changes in data characteristics , 1999, PODS '99.

[18]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[19]  R. Shibata BOOTSTRAP ESTIMATE OF KULLBACK-LEIBLER INFORMATION FOR MODEL SELECTION , 1997 .

[20]  John D. Lafferty,et al.  Inducing Features of Random Fields , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[21]  Eamonn J. Keogh,et al.  Finding surprising patterns in a time series database in linear time and space , 2002, KDD.

[22]  Raphail E. Krichevsky,et al.  The performance of universal encoding , 1981, IEEE Trans. Inf. Theory.

[23]  J. Fleiss Statistical methods for rates and proportions , 1974 .

[24]  Shai Ben-David,et al.  Detecting Change in Data Streams , 2004, VLDB.