The problem of bias in training data in regression problems in medical decision support

This paper describes a bias problem encountered in a machine learning approach to outcome prediction in anticoagulant drug therapy. The outcome to be predicted is a measure of the clotting time for the patient; this measure is continuous and so the prediction task is a regression problem. Artificial neural networks (ANNs) are a powerful mechanism for learning to predict such outcomes from training data. However, experiments have shown that an ANN is biased towards values more commonly occurring in the training data and is thus, less likely to be correct in predicting extreme values. This issue of bias in training data in regression problems is similar to the associated problem with minority classes in classification. However, this bias issue in classification is well documented and is an on-going area of research. In this paper, we consider stratified sampling and boosting as solutions to this bias problem and evaluate them on this outcome prediction problem and on two other datasets. Both approaches produce some improvements with boosting showing the most promise.

[1]  Ron Kohavi,et al.  The Case against Accuracy Estimation for Comparing Induction Algorithms , 1998, ICML.

[2]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[3]  Claire Cardie,et al.  Improving Minority Class Prediction Using Case-Specific Feature Weights , 1997, ICML.

[4]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[5]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[6]  P. Cunningham,et al.  Using Neural Nets for Decision Support in Prescription and Outcome Prediction in Anticoagulation Drug Therapy , 2000 .

[7]  Pádraig Cunningham,et al.  Stability problems with artificial neural networks and the ensemble solution , 2000, Artif. Intell. Medicine.

[8]  Salvatore J. Stolfo,et al.  Toward Scalable Learning with Non-Uniform Class and Cost Distributions: A Case Study in Credit Card Fraud Detection , 1998, KDD.

[9]  Amanda J. C. Sharkey,et al.  Boosting Using Neural Networks , 1999 .

[10]  David D. Lewis,et al.  Heterogeneous Uncertainty Sampling for Supervised Learning , 1994, ICML.

[11]  Stan Matwin,et al.  Machine Learning for the Detection of Oil Spills in Satellite Radar Images , 1998, Machine Learning.

[12]  Amanda J. C. Sharkey,et al.  Combining Artificial Neural Nets: Ensemble and Modular Multi-Net Systems , 1999 .

[13]  Robert E. Schapire,et al.  A Brief Introduction to Boosting , 1999, IJCAI.

[14]  J A Swets,et al.  Measuring the accuracy of diagnostic systems. , 1988, Science.

[15]  Nathan Intrator,et al.  Boosted Mixture of Experts: An Ensemble Learning Scheme , 1999, Neural Computation.

[16]  Christopher M. Bishop,et al.  Classification and regression , 1997 .

[17]  Michael J. Pazzani,et al.  Reducing Misclassification Costs , 1994, ICML.

[18]  Peter D. Turney Learning Algorithms for Keyphrase Extraction , 2000, Information Retrieval.

[19]  Tom Fawcett,et al.  Robust Classification for Imprecise Environments , 2000, Machine Learning.