Data classification with binary response through the Boosting algorithm and logistic regression

Abstract The task of classifying is natural to humans, but there are situations in which a person is not best suited to perform this function, which creates the need for automatic methods of classification. Traditional methods, such as logistic regression, are commonly used in this type of situation, but they lack robustness and accuracy. These methods do not not work very well when the data or when there is noise in the data, situations that are common in expert and intelligent systems. Due to the importance and the increasing complexity of problems of this type, there is a need for methods that provide greater accuracy and interpretability of the results. Among these methods, is Boosting, which operates sequentially by applying a classification algorithm to reweighted versions of the training data set. It was recently shown that Boosting may also be viewed as a method for functional estimation. The purpose of the present study was to compare the logistic regressions estimated by the maximum likelihood model (LRMML) and the logistic regression model estimated using the Boosting algorithm, specifically the Binomial Boosting algorithm (LRMBB), and to select the model with the better fit and discrimination capacity in the situation of presence(absence) of a given property (in this case, binary classification). To illustrate this situation, the example used was to classify the presence (absence) of coronary heart disease (CHD) as a function of various biological variables collected from patients. It is shown in the simulations results based on the strength of the indications that the LRMBB model is more appropriate than the LRMML model for the adjustment of data sets with several covariables and noisy data. The following sections report lower values of the information criteria AIC and BIC for the LRMBB model and that the Hosmer–Lemeshow test exhibits no evidence of a bad fit for the LRMBB model. The LRMBB model also presented a higher AUC, sensitivity, specificity and accuracy and lower values of false positives rates and false negatives rates, making it a model with better discrimination power compared to the LRMML model. Based on these results, the logistic model adjusted via the Binomial Boosting algorithm (LRMBB model) is better suited to describe the problem of binary response, because it provides more accurate information regarding the problem considered.

[1]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[2]  Jean-Michel Poggi,et al.  Boosting and instability for regression trees , 2006, Comput. Stat. Data Anal..

[3]  Paulo César Emiliano,et al.  Information criteria: How do they behave in different models? , 2014, Comput. Stat. Data Anal..

[4]  R. Schapire The Strength of Weak Learnability , 1990, Machine Learning.

[5]  J. Hanley Receiver operating characteristic (ROC) methodology: the state of the art. , 1989, Critical reviews in diagnostic imaging.

[6]  S. Weisberg Plots, transformations, and regression , 1985 .

[7]  Peter L. Bartlett,et al.  AdaBoost is Consistent , 2006, J. Mach. Learn. Res..

[8]  Peter Buhlmann,et al.  BOOSTING ALGORITHMS: REGULARIZATION, PREDICTION AND MODEL FITTING , 2007, 0804.2752.

[9]  Robert P. W. Duin,et al.  Bagging, Boosting and the Random Subspace Method for Linear Classifiers , 2002, Pattern Analysis & Applications.

[10]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[11]  Marcel Dettling,et al.  BagBoosting for tumor classification with gene expression data , 2004, Bioinform..

[12]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[13]  Matthias Schonlau,et al.  Boosted Regression (Boosting): An Introductory Tutorial and a Stata Plugin , 2005 .

[14]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[15]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[16]  Xiao-Hua Zhou,et al.  Statistical Methods in Diagnostic Medicine , 2002 .

[17]  H. Akaike A new look at the statistical model identification , 1974 .

[18]  Rocco A. Servedio,et al.  Boosting in the presence of noise , 2005, J. Comput. Syst. Sci..

[19]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[20]  Elisabeth Costa Monteiro,et al.  Seleção de variáveis e classificação de padrões por redes neurais como auxílio ao diagnóstico de cardiopatia isquêmica , 2008 .

[21]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[22]  Dong-Sheng Cao,et al.  The boosting: A new idea of building models , 2010 .

[23]  K. Chou,et al.  Using LogitBoost classifier to predict protein structural classes. , 2006, Journal of theoretical biology.