Analysis of sampling techniques for imbalanced data: An n=648 ADNI study

Many neuroimaging applications deal with imbalanced imaging data. For example, in Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset, the mild cognitive impairment (MCI) cases eligible for the study are nearly two times the Alzheimer's disease (AD) patients for structural magnetic resonance imaging (MRI) modality and six times the control cases for proteomics modality. Constructing an accurate classifier from imbalanced data is a challenging task. Traditional classifiers that aim to maximize the overall prediction accuracy tend to classify all data into the majority class. In this paper, we study an ensemble system of feature selection and data sampling for the class imbalance problem. We systematically analyze various sampling techniques by examining the efficacy of different rates and types of undersampling, oversampling, and a combination of over and undersampling approaches. We thoroughly examine six widely used feature selection algorithms to identify significant biomarkers and thereby reduce the complexity of the data. The efficacy of the ensemble techniques is evaluated using two different classifiers including Random Forest and Support Vector Machines based on classification accuracy, area under the receiver operating characteristic curve (AUC), sensitivity, and specificity measures. Our extensive experimental results show that for various problem settings in ADNI, (1) a balanced training set obtained with K-Medoids technique based undersampling gives the best overall performance among different data sampling techniques and no sampling approach; and (2) sparse logistic regression with stability selection achieves competitive performance among various feature selection algorithms. Comprehensive experiments with various settings show that our proposed ensemble model of multiple undersampled datasets yields stable and promising results.

[1]  R. Srihari,et al.  Optimally Combining Positive and Negative Features for Text Categorization , 2003 .

[2]  Michael Weiner,et al.  Voxelwise gene-wide association study (vGeneWAS): Multivariate gene-based association testing in 731 elderly subjects , 2011, NeuroImage.

[3]  R. Mayeux,et al.  Hippocampal and entorhinal atrophy in mild cognitive impairment , 2007, Neurology.

[4]  Xiaoqian Jiang,et al.  Improving predictions in imbalanced data using Pairwise Expanded Logistic Regression. , 2011, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[5]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[6]  Paul M. Thompson,et al.  Multi-source feature learning for joint analysis of incomplete multiple heterogeneous neuroimaging data , 2012, NeuroImage.

[7]  Andrew J. Saykin,et al.  Voxelwise genome-wide association study (vGWAS) , 2010, NeuroImage.

[8]  Vipin Kumar,et al.  Evaluating boosting algorithms to classify rare classes: comparison and improvements , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[9]  Denise C. Park,et al.  Toward defining the preclinical stages of Alzheimer’s disease: Recommendations from the National Institute on Aging-Alzheimer's Association workgroups on diagnostic guidelines for Alzheimer's disease , 2011, Alzheimer's & Dementia.

[10]  Arnaud Cachia,et al.  Feature selection and classification of imbalanced datasets Application to PET images of children with autistic spectrum disorders , 2011, NeuroImage.

[11]  Stephen Kwek,et al.  Applying Support Vector Machines to Imbalanced Datasets , 2004, ECML.

[12]  Michael J. Pazzani,et al.  Reducing Misclassification Costs , 1994, ICML.

[13]  Carla E. Brodley,et al.  Pruning Decision Trees with Misclassification Costs , 1998, ECML.

[14]  Marie Chupin,et al.  Automatic classi fi cation of patients with Alzheimer ' s disease from structural MRI : A comparison of ten methods using the ADNI database , 2010 .

[15]  Nick C Fox,et al.  The Alzheimer's disease neuroimaging initiative (ADNI): MRI methods , 2008, Journal of magnetic resonance imaging : JMRI.

[16]  John Langford,et al.  Cost-sensitive learning by cost-proportionate example weighting , 2003, Third IEEE International Conference on Data Mining.

[17]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[18]  N. Japkowicz Learning from Imbalanced Data Sets: A Comparison of Various Strategies * , 2000 .

[19]  J. Haines,et al.  Gene dose of apolipoprotein E type 4 allele and the risk of Alzheimer's disease in late onset families. , 1993, Science.

[20]  G. Bartzokis Age-related myelin breakdown: a developmental model of cognitive decline and Alzheimer’s disease , 2004, Neurobiology of Aging.

[21]  Huan Liu,et al.  Advancing feature selection research , 2010 .

[22]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[23]  Michael Weiner,et al.  Genome-wide analysis reveals novel genes influencing temporal lobe structure with relevance to neurodegeneration in Alzheimer's disease , 2010, NeuroImage.

[24]  Ralescu Anca,et al.  ISSUES IN MINING IMBALANCED DATA SETS - A REVIEW PAPER , 2005 .

[25]  Hae-Chang Rim,et al.  Biomedical named entity recognition using two-phase model based on SVMs , 2004, J. Biomed. Informatics.

[26]  Mark E. Schmidt,et al.  The Alzheimer's Disease Neuroimaging Initiative: A review of papers published since its inception , 2012, Alzheimer's & Dementia.

[27]  C. Jack,et al.  Boosting power for clinical trials using classifiers based on multiple biomarkers , 2010, Neurobiology of Aging.

[28]  Zhi-Hua Zhou,et al.  Exploratory Undersampling for Class-Imbalance Learning , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[29]  Guanghua Xiao,et al.  A Blood-Based Screening Tool for Alzheimer's Disease That Spans Serum and Plasma: Findings from TARC and ADNI , 2011, PloS one.

[30]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[31]  Yoram Singer,et al.  Efficient projections onto the l1-ball for learning in high dimensions , 2008, ICML '08.

[32]  Marko Robnik-Sikonja,et al.  Theoretical and Empirical Analysis of ReliefF and RReliefF , 2003, Machine Learning.

[33]  Xindong Wu,et al.  10 Challenging Problems in Data Mining Research , 2006, Int. J. Inf. Technol. Decis. Mak..

[34]  Chao Chen,et al.  Using Random Forest to Learn Imbalanced Data , 2004 .

[35]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[36]  Taeho Jo,et al.  Class imbalances versus small disjuncts , 2004, SKDD.

[37]  Nick C Fox,et al.  The clinical use of structural MRI in Alzheimer disease , 2010, Nature Reviews Neurology.

[38]  A D Roses,et al.  Utility of the apolipoprotein E genotype in the diagnosis of Alzheimer's disease. Alzheimer's Disease Centers Consortium on Apolipoprotein E and Alzheimer's Disease. , 1998, The New England journal of medicine.

[39]  Yang Song,et al.  Surface-based Tbm Boosts Power to Detect Disease Effects on the Brain: an N = 804 Adni Study ☆ and the Alzheimer's Disease Neuroimaging Initiative , 2022 .

[40]  Tom Fawcett,et al.  Robust Classification for Imprecise Environments , 2000, Machine Learning.

[41]  T.M. Padmaja,et al.  Majority filter-based minority prediction (MFMP): An approach for unbalanced datasets , 2008, TENCON 2008 - 2008 IEEE Region 10 Conference.

[42]  Yue-Shi Lee,et al.  Cluster-Based Sampling Approaches to Imbalanced Data Distributions , 2006, DaWaK.

[43]  R. Petersen,et al.  Cerebrospinal fluid biomarker signature in Alzheimer's disease neuroimaging initiative subjects , 2009, Annals of neurology.

[44]  Michael W. Weiner,et al.  Genome-wide analysis reveals novel genes in fl uencing temporal lobe structure with relevance to neurodegeneration in Alzheimer ' s disease , 2010 .

[45]  Regina Berretta,et al.  Multivariate Protein Signatures of Pre-Clinical Alzheimer's Disease in the Alzheimer's Disease Neuroimaging Initiative (ADNI) Plasma Proteome Dataset , 2012, PloS one.

[46]  Rashmi Dubey Machine Learning Methods for Biosignature Discovery , 2012 .

[47]  Shuiwang Ji,et al.  SLEP: Sparse Learning with Efficient Projections , 2011 .

[48]  Joshua Alspector,et al.  Data duplication: an imbalance problem ? , 2003 .

[49]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[50]  Charles X. Ling,et al.  Data Mining for Direct Marketing: Problems and Solutions , 1998, KDD.

[51]  A. Simmons,et al.  Combination analysis of neuropsychological tests and structural MRI measures in differentiating AD, MCI and control groups—The AddNeuroMed study , 2011, Neurobiology of Aging.

[52]  Pablo Moscato,et al.  Identification of a 5-Protein Biomarker Molecular Signature for Predicting Alzheimer's Disease , 2008, PloS one.

[53]  Robert C. Holte,et al.  C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling , 2003 .

[54]  D. Bennett,et al.  MRI-derived entorhinal and hippocampal atrophy in incipient and very mild Alzheimer’s disease☆ ☆ This research was supported by grants P01 AG09466 and P30 AG10161 from the National Institute on Aging, National Institutes of Health. , 2001, Neurobiology of Aging.

[55]  Taghi M. Khoshgoftaar,et al.  Experimental perspectives on learning from imbalanced data , 2007, ICML '07.

[56]  Nathalie Japkowicz,et al.  The Class Imbalance Problem: Significance and Strategies , 2000 .

[57]  R. Tibshirani,et al.  Classification and prediction of clinical Alzheimer's diagnosis based on plasma signaling proteins , 2007, Nature Medicine.

[58]  William J. Jagust,et al.  Brain imaging in the study of Alzheimer's disease , 2012, NeuroImage.

[59]  J. Trojanowski,et al.  Prediction of MCI to AD conversion, via MRI, CSF biomarkers, and pattern classification , 2011, Neurobiology of Aging.

[60]  Jieping Ye,et al.  Large-scale sparse logistic regression , 2009, KDD.

[61]  Tso-Jung Yen,et al.  Discussion on "Stability Selection" by Meinshausen and Buhlmann , 2010 .

[62]  Mert R. Sabuncu,et al.  Statistical analysis of longitudinal neuroimage data with Linear Mixed Effects models , 2013, NeuroImage.

[63]  T. Chan,et al.  Independent component analysis-based classification of Alzheimer's disease MRI data. , 2011, Journal of Alzheimer's disease : JAD.

[64]  Nathalie Japkowicz,et al.  Supervised Versus Unsupervised Binary-Learning by Feedforward Neural Networks , 2004, Machine Learning.

[65]  Xiaoying Wu,et al.  Structural and functional biomarkers of prodromal Alzheimer's disease: A high-dimensional pattern classification study , 2008, NeuroImage.

[66]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[67]  Yuan Qi,et al.  Identifying Neuroimaging and Proteomic Biomarkers for MCI and AD via the Elastic Net , 2011, MBIA.

[68]  Gholamreza Nakhaeizadeh,et al.  Cost-Sensitive Pruning of Decision Trees , 1994, ECML.

[69]  N. Meinshausen,et al.  Stability selection , 2008, 0809.2932.

[70]  Salvatore J. Stolfo,et al.  Toward Scalable Learning with Non-Uniform Class and Cost Distributions: A Case Study in Credit Card Fraud Detection , 1998, KDD.

[71]  Dinggang Shen,et al.  Hierarchical Anatomical Brain Networks for MCI Prediction: Revisiting Volumetric Measures , 2011, PloS one.

[72]  Foster Provost,et al.  Machine Learning from Imbalanced Data Sets 101 , 2008 .

[73]  C. Jack,et al.  Ways toward an early diagnosis in Alzheimer’s disease: The Alzheimer’s Disease Neuroimaging Initiative (ADNI) , 2005, Alzheimer's & Dementia.

[74]  M. Maloof Learning When Data Sets are Imbalanced and When Costs are Unequal and Unknown , 2003 .

[75]  Wenjiang J. Fu Penalized Regressions: The Bridge versus the Lasso , 1998 .

[76]  Taeho Jo,et al.  A Multiple Resampling Method for Learning from Imbalanced Data Sets , 2004, Comput. Intell..

[77]  J. Ware,et al.  Applied Longitudinal Analysis , 2004 .

[78]  C. Lee Giles,et al.  Learning on the border: active learning in imbalanced data classification , 2007, CIKM '07.

[79]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[80]  Gert Lubec,et al.  Decreased brain levels of 2′,3′-cyclic nucleotide-3′-phosphodiesterase in Down syndrome and Alzheimer’s disease , 2001, Neurobiology of Aging.