Efficient Truncated Statistics with Unknown Truncation

We study the problem of estimating the parameters of a Gaussian distribution when samples are only shown if they fall in some (unknown) set. This core problem in truncated statistics has long history going back to Galton, Lee, Pearson and Fisher. Recent work by Daskalakis et al. (FOCS'18), provides the first efficient algorithm that works for arbitrary sets in high dimension when the set is known, but leaves as an open problem the more challenging and relevant case of unknown truncation set. Our main result is a computationally and sample efficient algorithm for estimating the parameters of the Gaussian under arbitrary unknown truncation sets whose performance decays with a natural measure of complexity of the set, namely its Gaussian surface area. Notably, this algorithm works for large families of sets including intersections of halfspaces, polynomial threshold functions and general convex sets. We show that our algorithm closely captures the tradeoff between the complexity of the set and the number of samples needed to learn the parameters by exhibiting a set with small Gaussian surface area for which it is information theoretically impossible to learn the true Gaussian with few samples.

[1]  Francis Galton,et al.  An examination into the registered speeds of American trotting horses, with remarks on their value as hereditary data , 1898, Proceedings of the Royal Society of London.

[2]  M. Ledoux Semigroup proofs of the isoperimetric inequality in Euclidean and Gauss space , 1994 .

[3]  K. Pearson,et al.  ON THE GENERALISED PROBABLE ERROR IN MULTIPLE NORMAL CORRELATION , 1908 .

[4]  Ronen Eldan A Polynomial Number of Random Points Does Not Determine the Volume of a Convex Body , 2011, Discret. Comput. Geom..

[5]  Narayanaswamy Balakrishnan,et al.  The Art of Progressive Censoring , 2014 .

[6]  Santosh S. Vempala,et al.  Agnostic Estimation of Mean and Covariance , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[7]  Alan M. Frieze,et al.  Learning linear transformations , 1996, Proceedings of 37th Conference on Foundations of Computer Science.

[8]  Navin Goyal,et al.  Efficient Learning of Simplices , 2012, COLT.

[9]  Navin Goyal,et al.  Learning Convex Bodies is Hard , 2009, COLT.

[10]  Rene F. Swarttouw,et al.  Orthogonal polynomials , 2020, NIST Handbook of Mathematical Functions.

[11]  Rocco A. Servedio,et al.  Learning from satisfying assignments , 2015, SODA.

[12]  Jerry Li,et al.  Being Robust (in High Dimensions) Can Be Practical , 2017, ICML.

[13]  M. C. Jaiswal,et al.  Estimation of parameters of doubly truncated normal distribution from first four sample moments , 1966 .

[14]  Ryan O'Donnell,et al.  Learning Geometric Concepts via Gaussian Surface Area , 2008, 2008 49th Annual IEEE Symposium on Foundations of Computer Science.

[15]  Jerry Li,et al.  Robustly Learning a Gaussian: Getting Optimal Error, Efficiently , 2017, SODA.

[16]  Daniel M. Kane,et al.  Robust Estimators in High Dimensions without the Computational Intractability , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[17]  Alice Lee TABLE OF THE GAUSSIAN “TAIL” FUNCTIONS; WHEN THE “TAIL” IS LARGER THAN THE BODY , 1914 .

[18]  Rocco A. Servedio,et al.  Agnostically learning halfspaces , 2005, 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS'05).

[19]  Luc Devroye,et al.  Combinatorial methods in density estimation , 2001, Springer series in statistics.

[20]  T. Sanders,et al.  Analysis of Boolean Functions , 2012, ArXiv.

[21]  Frann Cois Denis,et al.  PAC Learning from Positive Statistical Queries , 1998, ALT.

[22]  C. B. Morgan Truncated and Censored Samples, Theory and Applications , 1993 .

[23]  F. Nazarov On the Maximal Perimeter of a Convex Set in $ ℝ n $$\mathbb{R}^n$ with Respect to a Gaussian Measure , 2003 .

[24]  Keith Ball The reverse isoperimetric problem for Gaussian measure , 1993, Discret. Comput. Geom..

[25]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[26]  Rémi Gilleron,et al.  Learning from positive and unlabeled examples , 2000, Theor. Comput. Sci..

[27]  Christos Tzamos,et al.  Efficient Statistics, in High Dimensions, from Truncated Samples , 2018, 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS).

[28]  G. Pisier Probabilistic methods in the geometry of Banach spaces , 1986 .

[29]  Daniel M. Kane The Gaussian Surface Area and Noise Sensitivity of Degree-d Polynomial Threshold Functions , 2010, 2010 IEEE 25th Annual Conference on Computational Complexity.

[30]  Constantinos Daskalakis,et al.  Faster and Sample Near-Optimal Algorithms for Proper Learning Mixtures of Gaussians , 2013, COLT.

[31]  Gregory Valiant,et al.  Learning from untrusted data , 2016, STOC.

[32]  A. Carbery,et al.  Distributional and L-q norm inequalities for polynomials over convex bodies in R-n , 2001 .

[33]  Helmut Schneider Truncated and censored samples from normal populations , 1986 .