Population recovery and partial identification

We study several problems in which an unknown distribution over an unknown population of vectors needs to be recovered from partial or noisy samples, each of which nearly completely erases or obliterates the original vector. For example, consider a distribution $$p$$p over a population $$V \subseteq \{0,1\}^n$$V⊆{0,1}n. A noisy sample $$v'$$v′ is obtained by choosing $$v$$v according to $$p$$p and flipping each coordinate of $$v$$v with probability say 0.49 independently. The problem is to recover $$V,p$$V,p as efficiently as possible from noisy samples. Such problems naturally arise in a variety of contexts in learning, clustering, statistics, computational biology, data mining and database privacy, where loss and error may be introduced by nature, inaccurate measurements, or on purpose. We give fairly efficient algorithms to recover the data under fairly general assumptions. Underlying our algorithms is a new structure we call a partial identification (PID) graph for an arbitrary finite set of vectors over any alphabet. This graph captures the extent to which certain subsets of coordinates in each vector distinguish it from other vectors. PID graphs yield strategies for dimension reductions and re-assembly of statistical information, and so may be useful in other applications as well. The quality of our algorithms (sequential and parallel runtime, as well as numerical stability) critically depends on three parameters of PID graphs: width, depth and cost. The combinatorial heart of this work is showing that every set of vectors possesses a PID graph in which all three parameters are small (we prove some limitations on their trade-offs as well). We further give an efficient algorithm to find such near-optimal PID graphs for any set of vectors. Our efficient PID graphs imply general algorithms for these recovery problems, even when loss or noise are just below the information-theoretic limit. In the learning/clustering context this gives a new algorithm for learning mixtures of binomial distributions (with known marginals) whose running time depends only quasi-polynomially on the number of clusters. We discuss implications to privacy and coding as well.

[1]  Elchanan Mossel,et al.  Phylogenetic mixtures: Concentration of measure in the large-tree limit , 2011, ArXiv.

[2]  Avi Wigderson,et al.  Restriction access , 2012, ITCS '12.

[3]  Ramakrishnan Srikant,et al.  Privacy-preserving data mining , 2000, SIGMOD '00.

[4]  Ronitt Rubinfeld,et al.  On the learnability of discrete distributions , 1994, STOC '94.

[5]  Avrim Blum,et al.  Relevant Examples and Relevant Features: Thoughts from Computational Learning Theory , 1994 .

[6]  Dimitris Achlioptas,et al.  On Spectral Learning of Mixtures of Distributions , 2005, COLT.

[7]  J. Feldman,et al.  Learning mixtures of product distributions over discrete domains , 2005, 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS'05).

[8]  Michael E. Saks,et al.  A Polynomial Time Algorithm for Lossy Population Recovery , 2013, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[9]  Jon Feldman,et al.  Learning mixtures of product distributions over discrete domains , 2005, FOCS.

[10]  Jeffrey C. Jackson,et al.  An efficient membership-query algorithm for learning DNF with respect to the uniform distribution , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[11]  Cynthia Dwork,et al.  Differential Privacy for Statistics: What we Know and What we Want to Learn , 2010, J. Priv. Confidentiality.

[12]  Henryk Wozniakowski,et al.  The statistical security of a statistical database , 1984, TODS.

[13]  Russell Impagliazzo,et al.  Finding Heavy Hitters from Lossy or Noisy Data , 2013, APPROX-RANDOM.

[14]  Ezio Lefons,et al.  An Analytic Approach to Statistical Databases , 1983, VLDB.

[15]  Manfred K. Warmuth,et al.  Sample Compression, Learnability, and the Vapnik-Chervonenkis Dimension , 1995, Machine Learning.

[16]  Irit Dinur,et al.  Revealing information while preserving privacy , 2003, PODS.

[17]  Rina Panigrahy,et al.  Trace reconstruction with constant deletion probability and related results , 2008, SODA '08.

[18]  Ankur Moitra,et al.  Settling the Polynomial Learnability of Mixtures of Gaussians , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[19]  Manfred K. Warmuth,et al.  Relating Data Compression and Learnability , 2003 .

[20]  W. T. Tutte The reconstruction problem in graph theory , 1977 .

[21]  Chong K. Liew,et al.  A data distortion by probability distribution , 1985, TODS.

[22]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[23]  Leland L. Beck,et al.  A security machanism for statistical database , 1980, TODS.

[24]  S L Warner,et al.  Randomized response: a survey technique for eliminating evasive answer bias. , 1965, Journal of the American Statistical Association.

[25]  G. Wood Binomial mixtures: geometric estimation of the mixing distribution , 1999 .

[26]  Elchanan Mossel,et al.  Mixed-up Trees: the Structure of Phylogenetic Mixtures , 2007, Bulletin of mathematical biology.

[27]  Eyal Kushilevitz,et al.  Learning decision trees using the Fourier spectrum , 1991, STOC '91.

[28]  Adam Tauman Kalai,et al.  Disentangling Gaussians , 2012, Commun. ACM.

[29]  Shachar Lovett,et al.  Improved Noisy Population Recovery, and Reverse Bonami-Beckner Inequality for Sparse Functions , 2014, Electron. Colloquium Comput. Complex..

[30]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .