Accessible Streaming Algorithms for the Chi-Square Test

We present space-efficient algorithms for performing Pearson’s chi-square goodness-of-fit test in a streaming setting. Since the chi-square test is one of the most well known and commonly used tests in statistics, it is surprising that there has been no prior work on designing streaming algorithms for it. The test is not based on a specific distribution assumption and has one-sample and two-sample variants. Given a stream of data, the one-sample variant tests if the stream is drawn from a fixed distribution. The two-sample variant tests if two data streams are drawn from the same or similar distributions. One major advantage of using statistical tests over other quantities commonly measured by streaming algorithms is that these tests do not require parameter tuning and have results that can be easily interpreted by data analysts. The problem that we solve in this paper is how to compute the chi-square test on streams with minimal parameter configuration and assumptions. We give rigorous proofs showing that it is possible to compute the chi-square statistic with high fidelity and an almost quadratic reduction in memory in the continuous case, but the categorical case only admits heuristic solutions. We validate the performance and accuracy of our algorithms through extensive testing on both real and synthetic data sets.

[1]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[2]  S. Muthukrishnan,et al.  Estimating Entropy and Entropy Norm on Data Streams , 2006, STACS.

[3]  Vyas Sekar,et al.  Data streaming algorithms for estimating entropy of network traffic , 2006, SIGMETRICS '06/Performance '06.

[4]  Bruce G. Lindsay,et al.  Approximate medians and other quantiles in one pass and with limited memory , 1998, SIGMOD '98.

[5]  J. Ian Munro,et al.  Selection and sorting with limited storage , 1978, 19th Annual Symposium on Foundations of Computer Science (sfcs 1978).

[6]  Carlo Zaniolo,et al.  Fast and accurate computation of equi-depth histograms over data streams , 2011, EDBT/ICDT '11.

[7]  Sudipto Guha,et al.  Streaming and sublinear approximation of entropy and information distances , 2005, SODA '06.

[8]  Lu Wang,et al.  Quantiles over data streams: experimental comparisons, new analyses, and further improvements , 2016, The VLDB Journal.

[9]  Charu C. Aggarwal,et al.  Data Streams - Models and Algorithms , 2014, Advances in Database Systems.

[10]  Qiang Chen,et al.  An anomaly detection technique based on a chi‐square statistic for detecting intrusions into information systems , 2001 .

[11]  Abhishek Kumar,et al.  A data streaming algorithm for estimating subpopulation flow size distribution , 2005, SIGMETRICS '05.

[12]  Divyakant Agrawal,et al.  Medians and beyond: new aggregation techniques for sensor networks , 2004, SenSys '04.

[13]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[14]  Edith Cohen,et al.  What You Can Do with Coordinated Samples , 2012, APPROX-RANDOM.

[15]  Sudipto Guha,et al.  Sketching information divergences , 2007, Machine Learning.

[16]  Sanjeev Khanna,et al.  Space-efficient online computation of quantile summaries , 2001, SIGMOD '01.

[17]  Fabrizio Angiulli,et al.  Detecting distance-based outliers in streams of data , 2007, CIKM '07.

[18]  G LindsayBruce,et al.  Random sampling techniques for space efficient online computation of order statistics of large datasets , 1999 .

[19]  S. Muthukrishnan,et al.  One-Pass Wavelet Decompositions of Data Streams , 2003, IEEE Trans. Knowl. Data Eng..

[20]  Lu Wang,et al.  Quantiles over data streams: an experimental study , 2013, SIGMOD '13.

[21]  Donald Ervin Knuth,et al.  The Art of Computer Programming, Volume II: Seminumerical Algorithms , 1970 .

[22]  Ashwin Lall,et al.  Data streaming algorithms for the Kolmogorov-Smirnov test , 2015, IEEE BigData.

[23]  Gennady Samorodnitsky,et al.  Sign Cauchy Projections and Chi-Square Kernel , 2013, NIPS.