Efficient Statistics, in High Dimensions, from Truncated Samples

We provide an efficient algorithm for the classical problem, going back to Galton, Pearson, and Fisher, of estimating, with arbitrary accuracy the parameters of a multivariate normal distribution from truncated samples. Truncated samples from a d-variate normal N(mu, Sigma) means a samples is only revealed if it falls in some subset S of the d-dimensional Euclidean space; otherwise the samples are hidden and their count in proportion to the revealed samples is also hidden. We show that the mean mu and covariance matrix Sigma can be estimated with arbitrary accuracy in polynomial-time, as long as we have oracle access to S, and S has non-trivial measure under the unknown d-variate normal distribution. Additionally we show that without oracle access to S, any non-trivial estimation is impossible.

[1]  Gregory Valiant,et al.  Learning from untrusted data , 2016, STOC.

[2]  A. Carbery,et al.  Distributional and L-q norm inequalities for polynomials over convex bodies in R-n , 2001 .

[3]  L. Isserlis ON A FORMULA FOR THE PRODUCT-MOMENT COEFFICIENT OF ANY ORDER OF A NORMAL FREQUENCY DISTRIBUTION IN ANY NUMBER OF VARIABLES , 1918 .

[4]  Roman Vershynin,et al.  Introduction to the non-asymptotic analysis of random matrices , 2010, Compressed Sensing.

[5]  Jerry Li,et al.  Being Robust (in High Dimensions) Can Be Practical , 2017, ICML.

[6]  Narayanaswamy Balakrishnan,et al.  The Art of Progressive Censoring , 2014 .

[7]  K. Pearson,et al.  ON THE GENERALISED PROBABLE ERROR IN MULTIPLE NORMAL CORRELATION , 1908 .

[8]  Francis Galton,et al.  An examination into the registered speeds of American trotting horses, with remarks on their value as hereditary data , 1898, Proceedings of the Royal Society of London.

[9]  Jerry Li,et al.  Computationally Efficient Robust Sparse Estimation in High Dimensions , 2017, COLT.

[10]  Yi Ma,et al.  Robust principal component analysis? , 2009, JACM.

[11]  C. B. Morgan Truncated and Censored Samples, Theory and Applications , 1993 .

[12]  Shie Mannor,et al.  Principal Component Analysis with Contaminated Data: The High Dimensional Case , 2010, COLT 2010.

[13]  G. C. Wick The Evaluation of the Collision Matrix , 1950 .

[14]  Alice Lee TABLE OF THE GAUSSIAN “TAIL” FUNCTIONS; WHEN THE “TAIL” IS LARGER THAN THE BODY , 1914 .

[15]  Ankur Moitra,et al.  Algorithms and Hardness for Robust Subspace Recovery , 2012, COLT.

[16]  Santosh S. Vempala,et al.  Agnostic Estimation of Mean and Covariance , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[17]  P. Massart,et al.  Adaptive estimation of a quadratic functional by model selection , 2000 .

[18]  Jerry Li,et al.  Robust Sparse Estimation Tasks in High Dimensions , 2017, ArXiv.

[19]  Prateek Jain,et al.  Robust Regression via Hard Thresholding , 2015, NIPS.

[20]  Gregory Valiant,et al.  Resilience: A Criterion for Learning in the Presence of Arbitrary Outliers , 2017, ITCS.

[21]  Helmut Schneider Truncated and censored samples from normal populations , 1986 .

[22]  J. Tukey Sufficiency, Truncation and Selection , 1949 .

[23]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[24]  Jerry Li,et al.  Robustly Learning a Gaussian: Getting Optimal Error, Efficiently , 2017, SODA.

[25]  Daniel M. Kane,et al.  Robust Estimators in High Dimensions without the Computational Intractability , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).