Classification of sparse high-dimensional vectors

We study the problem of classification of d-dimensional vectors into two classes (one of which is ‘pure noise’) based on a training sample of size m. The main specific feature is that the dimension d can be very large. We suppose that the difference between the distribution of the population and that of the noise is only in a shift, which is a sparse vector. For Gaussian noise, fixed sample size m, and dimension d that tends to infinity, we obtain the sharp classification boundary, i.e. the necessary and sufficient conditions for the possibility of successful classification. We propose classifiers attaining this boundary. We also give extensions of the result to the case where the sample size m depends on d and satisfies the condition , 0 ≤ γ < 1, and to the case of non-Gaussian noise satisfying the Cramér condition.

[1]  R. Z. Khasʹminskiĭ,et al.  Statistical estimation : asymptotic theory , 1981 .

[2]  V. V. Petrov,et al.  Limit Theorems of Probability Theory , 2000 .

[3]  V. Statulevičius,et al.  Limit Theorems of Probability Theory , 2000 .

[4]  J. Downing,et al.  Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. , 2002, Cancer cell.

[5]  Yu. I. Ingster,et al.  Nonparametric Goodness-of-Fit Testing Under Gaussian Models , 2002 .

[6]  J. Downing,et al.  Classification of pediatric acute lymphoblastic leukemia by gene expression profiling. , 2003, Blood.

[7]  J. Downing,et al.  Gene Expression Profiling of Pediatric Acute Myelogenous Leukemia Materials and Methods , 2022 .

[8]  Jiashun Jin,et al.  Detecting a target in very noisy data from multiple looks , 2004 .

[9]  D. Donoho,et al.  Higher criticism for detecting sparse heterogeneous mixtures , 2004, math/0410072.

[10]  Yu. I. Ingster,et al.  Detection of a signal of known shape in a multichannel system , 2005 .

[11]  J. Wellner,et al.  GOODNESS-OF-FIT TESTS VIA PHI-DIVERGENCES , 2006, math/0603238.

[12]  Jiashun Jin,et al.  Estimation and Confidence Sets for Sparse Normal Mixtures , 2006, math/0612623.

[13]  J. Haupt,et al.  Adaptive discovery of sparse signals in noise , 2008, 2008 42nd Asilomar Conference on Signals, Systems and Computers.

[14]  Malay Ghosh,et al.  Theoretical measures of relative performance of classifiers for high dimensional data with small sample sizes , 2008 .

[15]  D. Donoho,et al.  Higher criticism thresholding: Optimal feature selection when useful features are rare and weak , 2008, Proceedings of the National Academy of Sciences.

[16]  Jiashun Jin,et al.  Feature selection by higher criticism thresholding achieves the optimal phase diagram , 2008, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[17]  Yu. I. Ingster,et al.  Sparse classification boundaries , 2009, 0903.4807.

[18]  P. Hall,et al.  Innovated Higher Criticism for Detecting Sparse Signals in Correlated Noise , 2009, 0902.3837.