BACKGROUND
Various methods can be applied to build predictive models for the clinical data with binary outcome variable. This research aims to explore the process of constructing common predictive models, Logistic regression (LR), decision tree (DT) and multilayer perceptron (MLP), as well as focus on specific details when applying the methods mentioned above: what preconditions should be satisfied, how to set parameters of the model, how to screen variables and build accuracy models quickly and efficiently, and how to assess the generalization ability (that is, prediction performance) reliably by Monte Carlo method in the case of small sample size.
METHODS
All the 274 patients (include 137 type 2 diabetes mellitus with diabetic peripheral neuropathy and 137 type 2 diabetes mellitus without diabetic peripheral neuropathy) from the Metabolic Disease Hospital in Tianjin participated in the study. There were 30 variables such as sex, age, glycosylated hemoglobin, etc. On account of small sample size, the classification and regression tree (CART) with the chi-squared automatic interaction detector tree (CHAID) were combined by means of the 100 times 5-7 fold stratified cross-validation to build DT. The MLP was constructed by Schwarz Bayes Criterion to choose the number of hidden layers and hidden layer units, alone with levenberg-marquardt (L-M) optimization algorithm, weight decay and preliminary training method. Subsequently, LR was applied by the best subset method with the Akaike Information Criterion (AIC) to make the best used of information and avoid overfitting. Eventually, a 10 to 100 times 3-10 fold stratified cross-validation method was used to compare the generalization ability of DT, MLP and LR in view of the areas under the receiver operating characteristic (ROC) curves (AUC).
RESULTS
The AUC of DT, MLP and LR were 0.8863, 0.8536 and 0.8802, respectively. As the larger the AUC of a specific prediction model is, the higher diagnostic ability presents, MLP performed optimally, and then followed by LR and DT in terms of 10-100 times 2-10 fold stratified cross-validation in our study. Neural network model is a preferred option for the data. However, the best subset of multiple LR would be a better choice in view of efficiency and accuracy.
CONCLUSION
When dealing with data from small size sample, multiple independent variables and a dichotomous outcome variable, more strategies and statistical techniques (such as AIC criteria, L-M optimization algorithm, the best subset, etc.) should be considered to build a forecast model and some available methods (such as cross-validation, AUC, etc.) could be used for evaluation.