Decoding from Pooled Data: Sharp Information-Theoretic Bounds

Consider a population consisting of n individuals, each of whom has one of d types (e.g. their blood type, in which case d = 4). We are allowed to query this database by specifying a subset of the population, and in response we observe a noiseless histogram (a d-dimensional vector of counts) of types of the pooled individuals. This measurement model arises in practical situations such as pooling of genetic data and may also be motivated by privacy considerations. We are interested in the number of queries one needs to unambiguously determine the type of each individual. In this paper, we study this information-theoretic question under the random, dense setting where in each query, a random subset of individuals of size proportional to n is chosen. This makes the problem a particular example of a random constraint satisfaction problem (CSP) with a " planted " solution. We establish almost matching upper and lower bounds on the minimum number of queries m such that there is no solution other than the planted one with probability tending to 1 as n → ∞. Our proof relies on the computation of the exact " annealed free energy " of this model in the thermodynamic limit, which corresponds to the exponential rate of decay of the expected number of solution to this planted CSP. As a by-product of the analysis, we show an identity of independent interest relating the Gaussian integral over the space of Eulerian flows of a graph to its spanning tree polynomial.

[1]  丸山 徹 Convex Analysisの二,三の進展について , 1977 .

[2]  J. Vaaler A geometric inequality with applications to linear forms , 1979 .

[3]  S. Chaiken A Combinatorial Proof of the All Minors Matrix Tree Theorem , 1982 .

[4]  A. Sebő ON TWO RANDOM SEARCH PROBLEMS , 1985 .

[5]  N. Biggs Algebraic Potential Theory on Graphs , 1997 .

[6]  W. Chung,et al.  Pooling analysis of genetic data: the association of leptin receptor (LEPR) polymorphisms with variables related to human adiposity. , 2001, Genetics.

[7]  Toshiyuki Tanaka,et al.  A statistical-mechanics approach to large-system analysis of CDMA multiuser detectors , 2002, IEEE Trans. Inf. Theory.

[8]  M. O’Donovan,et al.  DNA Pooling: a tool for large-scale association studies , 2002, Nature Reviews Genetics.

[9]  Assaf Naor,et al.  The two possible values of the chromatic number of a random graph , 2004, STOC '04.

[10]  Cristopher Moore,et al.  The Chromatic Number of Random Regular Graphs , 2004, APPROX-RANDOM.

[11]  Kamil Zigangirov,et al.  Theory Of Code Division Multiple Access Communication , 2004 .

[12]  A. Naor,et al.  The two possible values of the chromatic number of a random graph , 2005 .

[13]  E. Candès,et al.  Stable signal recovery from incomplete and inaccurate measurements , 2005, math/0503066.

[14]  Emmanuel J. Candès,et al.  Decoding by linear programming , 2005, IEEE Transactions on Information Theory.

[15]  D. Donoho For most large underdetermined systems of linear equations the minimal 𝓁1‐norm solution is also the sparsest solution , 2006 .

[16]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[17]  David L Donoho,et al.  Compressed sensing , 2006, IEEE Transactions on Information Theory.

[18]  D. Du,et al.  Pooling Designs And Nonadaptive Group Testing: Important Tools For Dna Sequencing , 2006 .

[19]  Andrea Montanari,et al.  Gibbs states and the set of solutions of random constraint satisfaction problems , 2006, Proceedings of the National Academy of Sciences.

[20]  Amin Coja-Oghlan,et al.  Algorithmic Barriers from Phase Transitions , 2008, 2008 49th Annual IEEE Symposium on Foundations of Computer Science.

[21]  Amin Coja-Oghlan,et al.  Random Constraint Satisfaction Problems , 2009, DCM.

[22]  Emmanuel J. Candès,et al.  Exact Matrix Completion via Convex Optimization , 2009, Found. Comput. Math..

[23]  Sergio Verdú,et al.  Fundamental limits of almost lossless analog compression , 2009, 2009 IEEE International Symposium on Information Theory.

[24]  Florent Krzakala,et al.  Hiding Quiet Solutions in Random Constraint Satisfaction Problems , 2009, Physical review letters.

[25]  Elchanan Mossel,et al.  A Spectral Approach to Analysing Belief Propagation for 3-Colouring , 2009, Comb. Probab. Comput..

[26]  Pablo A. Parrilo,et al.  Guaranteed Minimum-Rank Solutions of Linear Matrix Equations via Nuclear Norm Minimization , 2007, SIAM Rev..

[27]  Marc Mézard,et al.  Group Testing With Random Pools: Optimal Two-Stage Algorithms , 2007, IEEE Transactions on Information Theory.

[28]  Andrea Montanari,et al.  Universality in Polytope Phase Transitions and Message Passing Algorithms , 2012, ArXiv.

[29]  D. Donoho,et al.  Information-theoretically optimal compressed sensing via spatial coupling and approximate message passing , 2013, 2012 IEEE International Symposium on Information Theory Proceedings.

[30]  Cristopher Moore,et al.  Tight Bounds on the Threshold for Permuted k-Colorability , 2011, APPROX-RANDOM.

[31]  Florent Krzakala,et al.  Non-adaptive pooling strategies for detection of rare faulty items , 2013, 2013 IEEE International Conference on Communications Workshops (ICC).

[32]  Allan Sly,et al.  Satisfiability Threshold for Random Regular nae-sat , 2013, Communications in Mathematical Physics.

[33]  Florent Krzakala,et al.  Reweighted Belief Propagation and Quiet Planting for Random K-SAT , 2012, J. Satisf. Boolean Model. Comput..

[34]  Santosh S. Vempala,et al.  University of Birmingham On the Complexity of Random Satisfiability Problems with Planted Solutions , 2018 .

[35]  Alan M. Frieze,et al.  Analyzing Walksat on Random Formulas , 2011, ANALCO.

[36]  Allan Sly,et al.  Proof of the Satisfiability Conjecture for Large k , 2014, STOC.

[37]  Florent Krzakala,et al.  Statistical physics of inference: thresholds and algorithms , 2015, ArXiv.

[38]  Allan Sly,et al.  Satisfiability Threshold for Random Regular nae-sat , 2016 .

[39]  Amin Coja-Oghlan,et al.  On the chromatic number of random regular graphs , 2016, J. Comb. Theory, Ser. B.

[40]  Jess Banks,et al.  Information-theoretic thresholds for community detection in sparse networks , 2016, COLT.

[41]  Kwang-Cheng Chen,et al.  Data extraction via histogram and arithmetic mean queries: Fundamental limits and algorithms , 2016, 2016 IEEE International Symposium on Information Theory (ISIT).

[42]  Will Perkins,et al.  Belief Propagation on Replica Symmetric Random Factor Graph Models , 2016, APPROX-RANDOM.

[43]  Allan Sly,et al.  The number of solutions for random regular NAE-SAT , 2016, Probability Theory and Related Fields.

[44]  V. Bapst,et al.  The Condensation Phase Transition in Random Graph Coloring , 2016 .

[45]  A. COJA-OGHLAN,et al.  Walksat Stalls Well Below Satisfiability , 2016, SIAM J. Discret. Math..

[46]  Will Perkins,et al.  Belief propagation on replica symmetric random factor graph models , 2018 .