Sampling Correctors

In many situations, sample data is obtained from a noisy or imperfect source. In order to address such corruptions, this paper introduces the concept of a sampling corrector. Such algorithms use structure that the distribution is purported to have, in order to allow one to make "on-the-fly" corrections to samples drawn from probability distributions. These algorithms then act as filters between the noisy data and the end user. We show connections between sampling correctors, distribution learning algorithms, and distribution property testing algorithms. We show that these connections can be utilized to expand the applicability of known distribution learning and property testing algorithms as well as to achieve improved algorithms for those tasks. As a first step, we show how to design sampling correctors using proper learning algorithms. We then focus on the question of whether algorithms for sampling correctors can be more efficient in terms of sample complexity than learning algorithms for the analogous families of distributions. When correcting monotonicity, we show that this is indeed the case when also granted query access to the cumulative distribution function. We also obtain sampling correctors for monotonicity without this stronger type of access, provided that the distribution be originally very close to monotone (namely, at a distance O(1/log2 n)). In addition to that, we consider a restricted error model that aims at capturing "missing data" corruptions. In this model, we show that distributions that are close to monotone have sampling correctors that are significantly more efficient than achievable by the learning approach. We then consider the question of whether an additional source of independent random bits is required by sampling correctors to implement the correction process. We show that for correcting close-to-uniform distributions and close-to-monotone distributions, no additional source of random bits is required, as the samples from the input source itself can be used to produce this randomness.

[1]  Ryan O'Donnell,et al.  Learning Sums of Independent Integer Random Variables , 2013, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[2]  Jere H. Lipps,et al.  Sampling bias, gradual extinction patterns and catastrophes in the fossil record , 1982 .

[3]  Gregory Valiant,et al.  A CLT and tight lower bounds for estimating entropy , 2010, Electron. Colloquium Comput. Complex..

[4]  Rocco A. Servedio,et al.  Explorer Efficient Density Estimation via Piecewise Polynomial Approximation , 2013 .

[5]  Ilias Diakonikolas,et al.  Sample-Optimal Density Estimation in Nearly-Linear Time , 2015, SODA.

[6]  Rocco A. Servedio,et al.  Learning k-Modal Distributions via Testing , 2012, Theory Comput..

[7]  A. Madansky Identification of Outliers , 1988 .

[8]  Bernard Chazelle,et al.  Property-Preserving Data Reconstruction , 2004, Algorithmica.

[9]  Ronitt Rubinfeld,et al.  The complexity of approximating entropy , 2002, STOC '02.

[10]  Eyal Kushilevitz,et al.  A Randomnesss-Rounds Tradeoff in Private Computation , 1994, CRYPTO.

[11]  Manuel Blum,et al.  Self-testing/correcting with applications to numerical problems , 1990, STOC '90.

[12]  Gregory Valiant,et al.  Estimating the unseen: A sublinear-sample canonical estimator of distributions , 2010, Electron. Colloquium Comput. Complex..

[13]  L. Birge On the Risk of Histograms for Estimating Decreasing Densities , 1987 .

[14]  Sourav Chakraborty,et al.  Efficient Sample Extractors for Juntas with Applications , 2011, ICALP.

[15]  Seshadhri Comandur,et al.  Testing Expansion in Bounded Degree Graphs , 2007, Electron. Colloquium Comput. Complex..

[16]  E. Blankenship,et al.  Correction of location errors for presence‐only species distribution models , 2014 .

[17]  Stefano Panzeri,et al.  Sampling bias , 2008, Scholarpedia.

[18]  Ronitt Rubinfeld,et al.  Testing Closeness of Discrete Distributions , 2010, JACM.

[19]  Shubhangi Saraf,et al.  Locally Decodable Codes , 2016, Encyclopedia of Algorithms.

[20]  Foster J. Provost,et al.  Handling Missing Values when Applying Classification Models , 2007, J. Mach. Learn. Res..

[21]  Eyal Kushilevitz,et al.  A Randomness-Rounds Tradeoff in Private Computation , 1994, SIAM J. Discret. Math..

[22]  Dana Ron,et al.  On Testing Expansion in Bounded-Degree Graphs , 2000, Studies in Complexity and Cryptography.

[23]  Ronitt Rubinfeld,et al.  Sublinear algorithms for testing monotone and unimodal distributions , 2004, STOC '04.

[24]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[25]  Ronitt Rubinfeld,et al.  Testing that distributions are close , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[26]  Jaikumar Radhakrishnan,et al.  Bounds for Dispersers, Extractors, and Depth-Two Superconcentrators , 2000, SIAM J. Discret. Math..

[27]  Paul D. Senese,et al.  A Unified Explanation of Territorial Conflict: Testing the Impact of Sampling Bias, 1919–1992 , 2003 .

[28]  Kyomin Jung,et al.  Lower Bounds for Local Monotonicity Reconstruction from Transitive-Closure Spanners , 2010, APPROX-RANDOM.

[29]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[30]  Vic Barnett,et al.  The Study of Outliers: Purpose and Model , 1978 .

[31]  Rocco A. Servedio,et al.  Learning mixtures of structured distributions over discrete domains , 2012, SODA.

[32]  Clément L. Canonne,et al.  A Survey on Distribution Testing: Your Data is Big. But is it Blue? , 2020, Electron. Colloquium Comput. Complex..

[33]  Rocco A. Servedio,et al.  Near-Optimal Density Estimation in Near-Linear Time Using Variable-Width Histograms , 2014, NIPS.

[34]  Sofya Raskhodnikova,et al.  Testing and Reconstruction of Lipschitz Functions with Applications to Data Privacy , 2011, 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science.

[35]  Russell Impagliazzo,et al.  How to recycle random bits , 1989, 30th Annual Symposium on Foundations of Computer Science.

[36]  Constantinos Daskalakis,et al.  Faster and Sample Near-Optimal Algorithms for Proper Learning Mixtures of Gaussians , 2013, COLT.

[37]  Zvika Brakerski Local Property Restoring , 2008 .

[38]  Ronitt Rubinfeld,et al.  Testing Probability Distributions Underlying Aggregated Data , 2014, ICALP.

[39]  U. Grenander On the theory of mortality measurement , 1956 .

[40]  Gregory Valiant,et al.  The Power of Linear Estimators , 2011, 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science.

[41]  H. Barnett A Theory of Mortality , 1968 .

[42]  Rocco A. Servedio,et al.  Testing k-Modal Distributions: Optimal Algorithms via Reductions , 2011, SODA.

[43]  J. Kiefer,et al.  Asymptotic Minimax Character of the Sample Distribution Function and of the Classical Multinomial Estimator , 1956 .

[44]  Persi Diaconis,et al.  Chapter 2: Basics of Representations and Characters , 1988 .

[45]  Ronitt Rubinfeld,et al.  Non‐Abelian homomorphism testing, and distributions close to their self‐convolutions , 2008, Random Struct. Algorithms.

[46]  Rocco A. Servedio,et al.  Learning Poisson Binomial Distributions , 2011, STOC '12.

[47]  Luc Devroye,et al.  Combinatorial methods in density estimation , 2001, Springer series in statistics.

[48]  Ronitt Rubinfeld,et al.  Local Reconstructors and Tolerant Testers for Connectivity and Diameter , 2012, APPROX-RANDOM.

[49]  Ronitt Rubinfeld,et al.  On the learnability of discrete distributions , 1994, STOC '94.

[50]  P. Massart The Tight Constant in the Dvoretzky-Kiefer-Wolfowitz Inequality , 1990 .

[51]  Ronitt Rubinfeld,et al.  Testing Shape Restrictions of Discrete Distributions , 2015, Theory of Computing Systems.

[52]  Kyomin Jung,et al.  Lower Bounds for Local Monotonicity Reconstruction from Transitive-Closure Spanners , 2012, SIAM J. Discret. Math..

[53]  Eldar Fischer,et al.  On the power of conditional samples in distribution testing , 2013, ITCS '13.

[54]  Ronitt Rubinfeld,et al.  Approximating and testing k-histogram distributions in sub-linear time , 2012, PODS '12.

[55]  Amit Sahai,et al.  Manipulating statistical difference , 1997, Randomization Methods in Algorithm Design.

[56]  Rocco A. Servedio,et al.  Testing equivalence between distributions using conditional samples , 2014, SODA.

[57]  N. S. Barnett,et al.  Private communication , 1969 .

[58]  Lena Osterhagen,et al.  Multiple Imputation For Nonresponse In Surveys , 2016 .

[59]  Teri A. Crosby,et al.  How to Detect and Handle Outliers , 1993 .

[60]  David E. Booth,et al.  Analysis of Incomplete Multivariate Data , 2000, Technometrics.

[61]  Alon Orlitsky,et al.  Sorting with adversarial comparators and application to density estimation , 2014, 2014 IEEE International Symposium on Information Theory.

[62]  Liam Paninski,et al.  A Coincidence-Based Test for Uniformity Given Very Sparsely Sampled Discrete Data , 2008, IEEE Transactions on Information Theory.

[63]  Michael E. Saks,et al.  Local Monotonicity Reconstruction , 2010, SIAM J. Comput..