Data streaming algorithms for the Kolmogorov-Smirnov test

We propose space-efficient algorithms for performing the Kolmogorov-Smirnov test on streaming data. The Kolmogorov-Smirnov test is a non-parametric test for measuring the strength of a hypothesis that some data is drawn from a fixed distribution (one-sample test), or that two sets of data are drawn from the same distribution (two-sample test). Unlike some other tests, Kolmogorov-Smirnov does not assume that the distribution has a known form (e.g., it is normal), and in the two-sample case it need not know anything about the distribution, other than that it is continuous. Motivated by the challenges of big data, we present algorithms for both the one-sample and the two-sample tests for data processed in a stream. We demonstrate the accuracy of our algorithms via extensive experimentation on both real and synthetic datasets. We show that our algorithms are superior to sampling and that they accurately perform the test with several orders of magnitude reduction in data.

[1]  Ronitt Rubinfeld,et al.  Testing that distributions are close , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[2]  Qi Zhang,et al.  A Fast Algorithm for Approximate Quantiles in High Speed Data Streams , 2007, 19th International Conference on Scientific and Statistical Database Management (SSDBM 2007).

[3]  Michael Lindenbaum,et al.  Learning High-Density Regions for a Generalized Kolmogorov-Smirnov Test in High-Dimensional Data , 2012, NIPS.

[4]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[5]  Jasper V. Wall,et al.  Practical Statistics for Astronomers , 2003 .

[6]  Vyas Sekar,et al.  Data streaming algorithms for estimating entropy of network traffic , 2006, SIGMETRICS '06/Performance '06.

[7]  Donald Ervin Knuth,et al.  The Art of Computer Programming, Volume II: Seminumerical Algorithms , 1970 .

[8]  Subhash Suri,et al.  Quantiles on Streams , 2009, Encyclopedia of Database Systems.

[9]  David P. Woodruff,et al.  Space-Efficient Estimation of Statistics Over Sub-Sampled Streams , 2012, PODS '12.

[10]  J. Ian Munro,et al.  Selection and sorting with limited storage , 1978, 19th Annual Symposium on Foundations of Computer Science (sfcs 1978).

[11]  Wei Hong,et al.  TinyDB: an acquisitional query processing system for sensor networks , 2005, TODS.

[12]  Michalis Faloutsos,et al.  A nonstationary Poisson view of Internet traffic , 2004, IEEE INFOCOM 2004.

[13]  Sudipto Guha,et al.  Streaming and sublinear approximation of entropy and information distances , 2005, SODA '06.

[14]  Nick G. Duffield,et al.  Fair sampling across network flow measurements , 2012, SIGMETRICS '12.

[15]  T. S. Jayram,et al.  Tight lower bounds for selection in randomly ordered streams , 2008, SODA '08.

[16]  Richard T. Schilizzi,et al.  The Square Kilometre Array , 2009, Proceedings of the IEEE.

[17]  Lu Wang,et al.  Quantiles over data streams: an experimental study , 2013, SIGMOD '13.

[18]  Sudipto Guha,et al.  Stream Order and Order Statistics: Quantile Estimation in Random-Order Streams , 2009, SIAM J. Comput..

[19]  kc claffy,et al.  Application of sampling methodologies to network traffic characterization , 1993, SIGCOMM 1993.

[20]  Bruce G. Lindsay,et al.  Random sampling techniques for space efficient online computation of order statistics of large datasets , 1999, SIGMOD '99.

[21]  Edith Cohen,et al.  Don't let the negatives bring you down: sampling from streams of signed updates , 2012, SIGMETRICS '12.

[22]  Shai Ben-David,et al.  Detecting Change in Data Streams , 2004, VLDB.

[23]  Abhishek Kumar,et al.  A data streaming algorithm for estimating subpopulation flow size distribution , 2005, SIGMETRICS '05.

[24]  S. Muthukrishnan,et al.  How to Summarize the Universe: Dynamic Maintenance of Quantiles , 2002, VLDB.

[25]  Divyakant Agrawal,et al.  Medians and beyond: new aggregation techniques for sensor networks , 2004, SenSys '04.

[26]  Sanjeev Khanna,et al.  Space-efficient online computation of quantile summaries , 2001, SIGMOD '01.