Sampling Correctors

In many situations, sample data is obtained from a noisy or imperfect source. In order to address such corruptions, this paper introduces the concept of a sampling corrector. Such algorithms use structure that the distribution is purported to have, in order to allow one to make "on-the-fly" corrections to samples drawn from probability distributions. These algorithms then act as filters between the noisy data and the end user. We show connections between sampling correctors, distribution learning algorithms, and distribution property testing algorithms. We show that these connections can be utilized to expand the applicability of known distribution learning and property testing algorithms as well as to achieve improved algorithms for those tasks. As a first step, we show how to design sampling correctors using proper learning algorithms. We then focus on the question of whether algorithms for sampling correctors can be more efficient in terms of sample complexity than learning algorithms for the analogous families of distributions. When correcting monotonicity, we show that this is indeed the case when also granted query access to the cumulative distribution function. We also obtain sampling correctors for monotonicity without this stronger type of access, provided that the distribution be originally very close to monotone (namely, at a distance O(1/log2 n)). In addition to that, we consider a restricted error model that aims at capturing "missing data" corruptions. In this model, we show that distributions that are close to monotone have sampling correctors that are significantly more efficient than achievable by the learning approach. We then consider the question of whether an additional source of independent random bits is required by sampling correctors to implement the correction process. We show that for correcting close-to-uniform distributions and close-to-monotone distributions, no additional source of random bits is required, as the samples from the input source itself can be used to produce this randomness.

[1]  Ronitt Rubinfeld,et al.  Testing Shape Restrictions of Discrete Distributions , 2015, Theory of Computing Systems.

[2]  Ilias Diakonikolas,et al.  Sample-Optimal Density Estimation in Nearly-Linear Time , 2015, SODA.

[3]  Lena Osterhagen,et al.  Multiple Imputation For Nonresponse In Surveys , 2016 .

[4]  Shubhangi Saraf,et al.  Locally Decodable Codes , 2016, Encyclopedia of Algorithms.

[5]  Clément L. Canonne,et al.  A Survey on Distribution Testing: Your Data is Big. But is it Blue? , 2020, Electron. Colloquium Comput. Complex..

[6]  Rocco A. Servedio,et al.  Near-Optimal Density Estimation in Near-Linear Time Using Variable-Width Histograms , 2014, NIPS.

[7]  Alon Orlitsky,et al.  Sorting with adversarial comparators and application to density estimation , 2014, 2014 IEEE International Symposium on Information Theory.

[8]  E. Blankenship,et al.  Correction of location errors for presence‐only species distribution models , 2014 .

[9]  Ronitt Rubinfeld,et al.  Testing Probability Distributions Underlying Aggregated Data , 2014, ICALP.

[10]  Rocco A. Servedio,et al.  Testing equivalence between distributions using conditional samples , 2014, SODA.

[11]  Constantinos Daskalakis,et al.  Faster and Sample Near-Optimal Algorithms for Proper Learning Mixtures of Gaussians , 2013, COLT.

[12]  Rocco A. Servedio,et al.  Explorer Efficient Density Estimation via Piecewise Polynomial Approximation , 2013 .

[13]  Rocco A. Servedio,et al.  Learning k-Modal Distributions via Testing , 2012, Theory Comput..

[14]  Ryan O'Donnell,et al.  Learning Sums of Independent Integer Random Variables , 2013, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[15]  Eldar Fischer,et al.  On the power of conditional samples in distribution testing , 2013, ITCS '13.

[16]  Rocco A. Servedio,et al.  Learning mixtures of structured distributions over discrete domains , 2012, SODA.

[17]  Ronitt Rubinfeld,et al.  Local Reconstructors and Tolerant Testers for Connectivity and Diameter , 2012, APPROX-RANDOM.

[18]  Rocco A. Servedio,et al.  Testing k-Modal Distributions: Optimal Algorithms via Reductions , 2011, SODA.

[19]  Ronitt Rubinfeld,et al.  Testing Closeness of Discrete Distributions , 2010, JACM.

[20]  Ronitt Rubinfeld,et al.  Approximating and testing k-histogram distributions in sub-linear time , 2012, PODS '12.

[21]  Rocco A. Servedio,et al.  Learning Poisson Binomial Distributions , 2011, STOC '12.

[22]  Sofya Raskhodnikova,et al.  Testing and Reconstruction of Lipschitz Functions with Applications to Data Privacy , 2011, 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science.

[23]  Gregory Valiant,et al.  The Power of Linear Estimators , 2011, 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science.

[24]  Sourav Chakraborty,et al.  Efficient Sample Extractors for Juntas with Applications , 2011, ICALP.

[25]  Kyomin Jung,et al.  Lower Bounds for Local Monotonicity Reconstruction from Transitive-Closure Spanners , 2010, APPROX-RANDOM.

[26]  Michael E. Saks,et al.  Local Monotonicity Reconstruction , 2010, SIAM J. Comput..

[27]  Gregory Valiant,et al.  Estimating the unseen: A sublinear-sample canonical estimator of distributions , 2010, Electron. Colloquium Comput. Complex..

[28]  Gregory Valiant,et al.  A CLT and tight lower bounds for estimating entropy , 2010, Electron. Colloquium Comput. Complex..

[29]  Liam Paninski,et al.  A Coincidence-Based Test for Uniformity Given Very Sparsely Sampled Discrete Data , 2008, IEEE Transactions on Information Theory.

[30]  Stefano Panzeri,et al.  Sampling bias , 2008, Scholarpedia.

[31]  Zvika Brakerski Local Property Restoring , 2008 .

[32]  Foster J. Provost,et al.  Handling Missing Values when Applying Classification Models , 2007, J. Mach. Learn. Res..

[33]  Bernard Chazelle,et al.  Property-Preserving Data Reconstruction , 2004, Algorithmica.

[34]  Seshadhri Comandur,et al.  Testing Expansion in Bounded Degree Graphs , 2007, Electron. Colloquium Comput. Complex..

[35]  Ronitt Rubinfeld,et al.  Non‐Abelian homomorphism testing, and distributions close to their self‐convolutions , 2008, Random Struct. Algorithms.

[36]  Ronitt Rubinfeld,et al.  Sublinear algorithms for testing monotone and unimodal distributions , 2004, STOC '04.

[37]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[38]  Paul D. Senese,et al.  A Unified Explanation of Territorial Conflict: Testing the Impact of Sampling Bias, 1919–1992 , 2003 .

[39]  Ronitt Rubinfeld,et al.  The complexity of approximating the entropy , 2002, Proceedings 17th IEEE Annual Conference on Computational Complexity.

[40]  Luc Devroye,et al.  Combinatorial methods in density estimation , 2001, Springer series in statistics.

[41]  Ronitt Rubinfeld,et al.  Testing that distributions are close , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[42]  David E. Booth,et al.  Analysis of Incomplete Multivariate Data , 2000, Technometrics.

[43]  Jaikumar Radhakrishnan,et al.  Bounds for Dispersers, Extractors, and Depth-Two Superconcentrators , 2000, SIAM J. Discret. Math..

[44]  Dana Ron,et al.  On Testing Expansion in Bounded-Degree Graphs , 2000, Studies in Complexity and Cryptography.

[45]  Eyal Kushilevitz,et al.  A Randomness-Rounds Tradeoff in Private Computation , 1994, SIAM J. Discret. Math..

[46]  Amit Sahai,et al.  Manipulating statistical difference , 1997, Randomization Methods in Algorithm Design.

[47]  Eyal Kushilevitz,et al.  A Randomnesss-Rounds Tradeoff in Private Computation , 1994, CRYPTO.

[48]  Ronitt Rubinfeld,et al.  On the learnability of discrete distributions , 1994, STOC '94.

[49]  Teri A. Crosby,et al.  How to Detect and Handle Outliers , 1993 .

[50]  P. Massart The Tight Constant in the Dvoretzky-Kiefer-Wolfowitz Inequality , 1990 .

[51]  Manuel Blum,et al.  Self-testing/correcting with applications to numerical problems , 1990, STOC '90.

[52]  Russell Impagliazzo,et al.  How to recycle random bits , 1989, 30th Annual Symposium on Foundations of Computer Science.

[53]  Persi Diaconis,et al.  Chapter 2: Basics of Representations and Characters , 1988 .

[54]  A. Madansky Identification of Outliers , 1988 .

[55]  L. Birge On the Risk of Histograms for Estimating Decreasing Densities , 1987 .

[56]  Jere H. Lipps,et al.  Sampling bias, gradual extinction patterns and catastrophes in the fossil record , 1982 .

[57]  Vic Barnett,et al.  The Study of Outliers: Purpose and Model , 1978 .

[58]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[59]  N. S. Barnett,et al.  Private communication , 1969 .

[60]  H. Barnett A Theory of Mortality , 1968 .

[61]  J. Kiefer,et al.  Asymptotic Minimax Character of the Sample Distribution Function and of the Classical Multinomial Estimator , 1956 .

[62]  U. Grenander On the theory of mortality measurement , 1956 .