Classification of High-Dimensional Data with Ensemble of Logistic Regression Models

A classification method is developed based on ensembles of logistic regression models, with each model fitted from a different set of predictors determined by a random partition of the feature space. The proposed method enables class prediction by an ensemble of logistic regression models for a high-dimensional data set, which is impossible by a single logistic regression model due to the restriction that the sample size needs to be larger than the number of predictors. The proposed classification method is applied to gene expression data on pediatric acute myeloid leukemia (AML) patients to predict each patient's risk for treatment failure or relapse at the time of diagnosis. Hence, specific prognostic biomarkers can be used to predict outcomes in pediatric AML and formulate individual risk-adjusted treatment. Our study shows that the proposed method is comparable to other widely used models in generalized accuracy and is significantly improved in balance between sensitivity and specificity. The proposed ensemble algorithm enables the standard classification model to be used for classification of high-dimensional data.

[1]  Robert P. W. Duin,et al.  Limits on the majority vote accuracy in classifier fusion , 2003, Pattern Analysis & Applications.

[2]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[3]  R. Schapire The Strength of Weak Learnability , 1990, Machine Learning.

[4]  H. Ahn,et al.  Decision threshold adjustment in class prediction , 2006, SAR and QSAR in environmental research.

[5]  Ashutosh Kumar Singh,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2010 .

[6]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[7]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[8]  James J. Chen,et al.  Classification by ensembles from random partitions of high-dimensional data , 2007, Comput. Stat. Data Anal..

[9]  Oleksandr Makeyev,et al.  Neural network with ensembles , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[10]  Misao Ohki,et al.  Identification of a gene expression signature associated with pediatric AML prognosis. , 2003, Blood.