Independent rule in classification of multivariate binary data

We consider the performance of the independent rule in classification of multivariate binary data. In this article, broad studies are presented including the performance of the independent rule when the number of variables, d, is fixed or increased with the sample size, n. The latter situation includes the case of d=O(n^@t) for @t>0 which cover ''the small sample and the large dimension'', namely d@?n when @t>1. Park and Ghosh [J. Park, J.K. Ghosh, Persistence of plug-in rule in classification of high dimensional binary data, Journal of Statistical Planning and Inference 137 (2007) 3687-3707] studied the independent rule in terms of the consistency of misclassification error rate which is called persistence under growing numbers of dimensions, but they did not investigate the convergence rate. We present asymptotic results in view of the convergence rate under some structured parameter space and highlight that variable selection is necessary to improve the performance of the independent rule. We also extend the applications of the independent rule to the case of correlated binary data such as the Bahadur representation and the logit model. It is emphasized that variable selection is also needed in correlated binary data for the improvement of the performance of the independent rule.

[1]  Jianqing Fan,et al.  Sure independence screening for ultrahigh dimensional feature space , 2006, math/0612857.

[2]  P. Spreij Probability and Measure , 1996 .

[3]  D. Hand,et al.  Idiot's Bayes—Not So Stupid After All? , 2001 .

[4]  Monica A. Walker,et al.  Studies in Item Analysis and Prediction. , 1962 .

[5]  Sarunas Raudys,et al.  On Dimensionality, Sample Size, Classification Error, and Complexity of Classification Algorithm in Pattern Recognition , 1980, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[7]  D. Cox The Analysis of Multivariate Binary Data , 1972 .

[8]  R W Doerge,et al.  Variable Selection in High‐Dimensional Multivariate Binary Data with Application to the Analysis of Microbial Community DNA Fingerprints , 2002, Biometrics.

[9]  W Y Zhang,et al.  Discussion on `Sure independence screening for ultra-high dimensional feature space' by Fan, J and Lv, J. , 2008 .

[10]  P. Billingsley,et al.  Probability and Measure , 1980 .

[11]  Jianqing Fan,et al.  High Dimensional Classification Using Features Annealed Independence Rules. , 2007, Annals of statistics.

[12]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[13]  Jayanta K. Ghosh,et al.  Persistence of plug-in rule in classification of high dimensional multivariate binary data , 2007 .

[14]  P. Bickel,et al.  Some theory for Fisher''s linear discriminant function , 2004 .

[15]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[16]  S. Portnoy Asymptotic Behavior of $M$-Estimators of $p$ Regression Parameters when $p^2/n$ is Large. I. Consistency , 1984 .

[17]  Y. Ritov,et al.  Persistence in high-dimensional linear predictor selection and the virtue of overparametrization , 2004 .

[18]  S. Portnoy Asymptotic behavior of M-estimators of p regression parameters when p , 1985 .