An active learning based classification strategy for the minority class problem: application to histopathology annotation

BackgroundSupervised classifiers for digital pathology can improve the ability of physicians to detect and diagnose diseases such as cancer. Generating training data for classifiers is problematic, since only domain experts (e.g. pathologists) can correctly label ground truth data. Additionally, digital pathology datasets suffer from the "minority class problem", an issue where the number of exemplars from the non-target class outnumber target class exemplars which can bias the classifier and reduce accuracy. In this paper, we develop a training strategy combining active learning (AL) with class-balancing. AL identifies unlabeled samples that are "informative" (i.e. likely to increase classifier performance) for annotation, avoiding non-informative samples. This yields high accuracy with a smaller training set size compared with random learning (RL). Previous AL methods have not explicitly accounted for the minority class problem in biomedical images. Pre-specifying a target class ratio mitigates the problem of training bias. Finally, we develop a mathematical model to predict the number of annotations (cost) required to achieve balanced training classes. In addition to predicting training cost, the model reveals the theoretical properties of AL in the context of the minority class problem.ResultsUsing this class-balanced AL training strategy (CBAL), we build a classifier to distinguish cancer from non-cancer regions on digitized prostate histopathology. Our dataset consists of 12,000 image regions sampled from 100 biopsies (58 prostate cancer patients). We compare CBAL against: (1) unbalanced AL (UBAL), which uses AL but ignores class ratio; (2) class-balanced RL (CBRL), which uses RL with a specific class ratio; and (3) unbalanced RL (UBRL). The CBAL-trained classifier yields 2% greater accuracy and 3% higher area under the receiver operating characteristic curve (AUC) than alternatively-trained classifiers. Our cost model accurately predicts the number of annotations necessary to obtain balanced classes. The accuracy of our prediction is verified by empirically-observed costs. Finally, we find that over-sampling the minority class yields a marginal improvement in classifier accuracy but the improved performance comes at the expense of greater annotation cost.ConclusionsWe have combined AL with class balancing to yield a general training strategy applicable to most supervised classification problems where the dataset is expensive to obtain and which suffers from the minority class problem. An intelligent training strategy is a critical component of supervised classification, but the integration of AL and intelligent choice of class ratios, as well as the application of a general cost model, will help researchers to plan the training process more quickly and effectively.

[1]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[2]  JapkowiczNathalie,et al.  The class imbalance problem: A systematic study , 2002 .

[3]  Jianzhong Li,et al.  A stable gene selection in microarray data analysis , 2006, BMC Bioinformatics.

[4]  Ehud Rivlin,et al.  A Microscopic Telepathology System for Multiresolution Computer-Aided Diagnostics , 2006, J. Multim..

[5]  Daphne Koller,et al.  Active Learning for Structure in Bayesian Networks , 2001, IJCAI.

[6]  Zhuowen Tu,et al.  Probabilistic boosting-tree: learning discriminative models for classification, recognition, and clustering , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[7]  David A. Cohn,et al.  Active Learning with Statistical Models , 1996, NIPS.

[8]  Edward H. Adelson,et al.  The Laplacian Pyramid as a Compact Image Code , 1983, IEEE Trans. Commun..

[9]  Purang Abolmaesumi,et al.  High-throughput detection of prostate cancer in histological sections using probabilistic pairwise Markov models , 2010, Medical Image Anal..

[10]  Byoung-Tak Zhang,et al.  AESNB: Active Example Selection with Naïve Bayes Classifier for Learning from Imbalanced Biomedical Data , 2009, 2009 Ninth IEEE International Conference on Bioinformatics and BioEngineering.

[11]  Paolo Avesani,et al.  Active Sampling for Knowledge Discovery from Biomedical Data , 2005, PKDD.

[12]  John Meyer,et al.  Grading nuclear pleomorphism on histological micrographs , 2008, 2008 19th International Conference on Pattern Recognition.

[13]  J. Ross Quinlan,et al.  Decision trees and decision-making , 1990, IEEE Trans. Syst. Man Cybern..

[14]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[15]  Anant Madabhushi,et al.  Automated grading of breast cancer histopathology using spectral clustering with textural and architectural image features , 2008, 2008 5th IEEE International Symposium on Biomedical Imaging: From Nano to Macro.

[16]  S. Hochreiter,et al.  REINFORCEMENT DRIVEN INFORMATION ACQUISITION IN NONDETERMINISTIC ENVIRONMENTS , 1995 .

[17]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[18]  Gyan Bhanot,et al.  Computerized Image-Based Detection and Grading of Lymphocytic Infiltration in HER2+ Breast Cancer Histopathology , 2010, IEEE Transactions on Biomedical Engineering.

[19]  A. Madabhushi,et al.  Histopathological Image Analysis: A Review , 2009, IEEE Reviews in Biomedical Engineering.

[20]  A. Madabhushi,et al.  Integrated diagnostics: a conceptual framework with examples , 2010, Clinical chemistry and laboratory medicine.

[21]  Anant Madabhushi,et al.  A Class Balanced Active Learning Scheme that Accounts for Minority Class Problems : Applications to Histopathology , 2009 .

[22]  Ishwar K. Sethi,et al.  Confidence-based active learning , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Anant Madabhushi,et al.  Consensus of Ambiguity: Theory and Application of Active Learning for Biomedical Image Analysis , 2010, PRIB.

[24]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Applying One-Sided Selection to Unbalanced Datasets , 2000, MICAI.

[25]  BMC Bioinformatics , 2005 .

[26]  Etienne Barnard,et al.  Data characteristics that determine classifier performance , 2006 .

[27]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[28]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[29]  H. Sebastian Seung,et al.  Selective Sampling Using the Query by Committee Algorithm , 1997, Machine Learning.

[30]  K. Vijay-Shanker,et al.  Taking into Account the Differences between Actively and Passively Acquired Data: The Case of Active Learning with Support Vector Machines for Imbalanced Datasets , 2009, NAACL.

[31]  Philipp Koehn,et al.  Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) , 2007 .

[32]  H. Sebastian Seung,et al.  Query by committee , 1992, COLT '92.

[33]  Robert M. Haralick,et al.  Textural Features for Image Classification , 1973, IEEE Trans. Syst. Man Cybern..

[34]  David A. Cohn,et al.  Improving generalization with active learning , 1994, Machine Learning.

[35]  Gyan Bhanot,et al.  Expectation–Maximization-Driven Geodesic Active Contour With Overlap Resolution (EMaGACOR): Application to Lymphocyte Segmentation on Breast Cancer Histopathology , 2010, IEEE Transactions on Biomedical Engineering.

[36]  B. S. Manjunath,et al.  Texture Features for Browsing and Retrieval of Image Data , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[37]  Anant Madabhushi,et al.  A Boosted Bayesian Multiresolution Classifier for Prostate Cancer Detection From Digitized Needle Biopsies , 2012, IEEE Transactions on Biomedical Engineering.

[38]  Jingbo Zhu,et al.  Active Learning for Word Sense Disambiguation with Methods for Addressing the Class Imbalance Problem , 2007, EMNLP.

[39]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[40]  Foster Provost,et al.  The effect of class distribution on classifier learning: an empirical study , 2001 .