Selection of Binary Variables and Classification by Boosting

We adopt boosting for classification and selection of high-dimensional binary variables for which classical methods based on normality and non singular sample dispersion are inapplicable. Boosting seems particularly well suited for binary variables. We present three methods of which two combine boosting with the relatively classical variable selection methods developed in Wilbur et al. (2002). Our primary interest is variable selection in classification with small misclassification error being used as validation of proposed method for variable selection. Two of the new methods perform uniformly better than Wilbur et al. (2002) in one set of simulated and three real life examples.

[1]  J. Franklin,et al.  The elements of statistical learning: data mining, inference and prediction , 2005 .

[2]  Ji Zhu,et al.  Boosting as a Regularized Path to a Maximum Margin Classifier , 2004, J. Mach. Learn. Res..

[3]  G. Lugosi,et al.  Complexity regularization via localized random penalties , 2004, math/0410091.

[4]  Y. Mansour,et al.  Generalization bounds for averaged classifiers , 2004, math/0410092.

[5]  L. Breiman Population theory for boosting ensembles , 2003 .

[6]  G. Lugosi,et al.  On the Bayes-risk consistency of regularized boosting methods , 2003 .

[7]  Wenxin Jiang Process consistency for AdaBoost , 2003 .

[8]  Tong Zhang Statistical behavior and consistency of classification methods based on convex risk minimization , 2003 .

[9]  R W Doerge,et al.  Variable Selection in High‐Dimensional Multivariate Binary Data with Application to the Analysis of Microbial Community DNA Fingerprints , 2002, Biometrics.

[10]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[11]  J. Wilbur Variable selection methodology for high -dimensional multivariate binary data with application to microbial community DNA fingerprint analysis , 2002 .

[12]  C. Nakatsu,et al.  Soil Community Analysis Using DGGE of 16S rDNA Polymerase Chain Reaction Products , 2000 .

[13]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[14]  Thomas G. Dietterich,et al.  Pruning Adaptive Boosting , 1997, ICML.

[15]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[16]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[17]  S. S. Young,et al.  Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment , 1993 .

[18]  A. Genz Numerical Computation of Multivariate Normal Probabilities , 1992 .

[19]  M. Piedmonte,et al.  A Method for Generating High-Dimensional Multivariate Binary Variates , 1991 .

[20]  D. Cox,et al.  Asymptotic techniques for use in statistics , 1989 .

[21]  C. W. Dunnett,et al.  The Numerical Evaluation of Certain Multivariate Normal Integrals , 1962 .