A Class Balanced Active Learning Scheme that Accounts for Minority Class Problems : Applications to Histopathology

Classifiers for detecting disease patterns in biomedical image data require manual annotations to serve as ground truth for training and evaluation, but are costly to obtain due to the complexity of the images and the expert medical knowledge required. An intelligent training strategy can maximize the efficiency of manual annotation. In this paper we present a novel class balanced active learning (CBAL) framework for classifier training to detect cancerous regions on prostate histopathology. The active learning (AL) algorithm identifies samples in a set of unlabeled data that will maximize the classification accuracy; only these samples are annotated, reducing the cost of training. We also address the minority class problem where one class (in this case, cancer) is underrepresented. By using a query strategy that adds equal numbers of instances from both object classes (cancer and non-cancer) to the training set, each class is well-represented resulting in high classifier accuracy. Finally, we present a cost model of our CBAL strategy. We use the CBAL framework to train a classifier for finding cancer in images of prostate histopathology, and compare its accuracy against training strategies using random learning (RL) and those that do not enforce equal proportion of instances from both classes. On a dataset of over 12,000 prostate image regions, we find that (1) using CBAL the resultant classifier achieves the maximum possible accuracy (i.e. accuracy obtained by using all available samples for training) by using two orders of magnitude fewer samples, and (2) the predicted cost of CBAL agrees well with the empirically determined cost, which is not significantly higher than RL.

[1]  S. Hochreiter,et al.  REINFORCEMENT DRIVEN INFORMATION ACQUISITION IN NONDETERMINISTIC ENVIRONMENTS , 1995 .

[2]  Bir Bhanu,et al.  Active concept learning in image databases , 2005, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[3]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[4]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[5]  H. Sebastian Seung,et al.  Query by committee , 1992, COLT '92.

[6]  Claire Cardie,et al.  Improving Minority Class Prediction Using Case-Specific Feature Weights , 1997, ICML.

[7]  Anant Madabhushi,et al.  A Boosting Cascade for Automated Detection of Prostate Cancer from Digitized Histology , 2006, MICCAI.

[8]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[9]  David A. Cohn,et al.  Active Learning with Statistical Models , 1996, NIPS.

[10]  Anant Madabhushi,et al.  Automated grading of breast cancer histopathology using spectral clustering with textural and architectural image features , 2008, 2008 5th IEEE International Symposium on Biomedical Imaging: From Nano to Macro.

[11]  J. Ross Quinlan,et al.  Decision trees and decision-making , 1990, IEEE Trans. Syst. Man Cybern..

[12]  Foster J. Provost,et al.  Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction , 2003, J. Artif. Intell. Res..