The problem of extracting categorical data via noisy histogram queries is investigated. The considered data set is a collection of n items, each of which carries a piece of categorical data taking values in a finite alphabet. Data analysts are allowed to query the data set through a curator by specifying a subset of items and then obtaining the histogram of the queried subset. The (unnormalized) histogram released by the curator, however, is perturbed by some additive noise with maximum magnitude δη. The goal of the data analyst is to reconstruct the categorical data set such that the Hamming distance between the reconstructed and the actual one is smaller than a tolerance parameter k<inf>n</inf>. In this work, we explore the fundamental limit on the minimum number of queries Τη, required for the analyst to reconstruct the n-item data set within kn tolerance subject to δη noisy perturbation. We first show that if δ<inf>n</inf> = O(√k<inf>n</inf>) the minimum query complexity T<sup>∗</sup><inf>n</inf> = Θ(n / log n), where the achievability is based on random sampling, and the converse is based on counting and packing arguments. On the other hand, if δ<inf>n</inf> = Ω(k<sup>(1+ε)/2</sup><inf>n</inf>) for some ∊> 0, we prove that T<sup>∗</sup><inf>n</inf> = ω(n<sup>p</sup>) for any positive integer p. In other words, no querying methods with polynomial-in-n query complexity can successfully reconstruct the data set in that regime. This impossibility result is established by a novel combinatorial lower bound on T<inf>n</inf>∗.
[1]
Jonathan Ullman,et al.
Fingerprinting Codes and the Price of Approximate Differential Privacy
,
2018,
SIAM J. Comput..
[2]
Cynthia Dwork,et al.
New Efficient Attacks on Statistical Disclosure Control Mechanisms
,
2008,
CRYPTO.
[3]
Michael I. Jordan,et al.
Decoding from Pooled Data: Sharp Information-Theoretic Bounds
,
2016,
SIAM J. Math. Data Sci..
[4]
Shao-Lun Huang,et al.
Extracting sparse data via histogram queries
,
2016,
2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton).
[5]
Irit Dinur,et al.
Revealing information while preserving privacy
,
2003,
PODS.
[6]
Aaron Roth,et al.
The Algorithmic Foundations of Differential Privacy
,
2014,
Found. Trends Theor. Comput. Sci..
[7]
Cynthia Dwork,et al.
The price of privacy and the limits of LP decoding
,
2007,
STOC '07.
[8]
Kwang-Cheng Chen,et al.
Data extraction via histogram and arithmetic mean queries: Fundamental limits and algorithms
,
2016,
2016 IEEE International Symposium on Information Theory (ISIT).