Comparison of methods for the detection of outliers and associated biomarkers in mislabeled omics data

Background Previous studies have reported that labeling errors are not uncommon in omics data. Potential outliers may severely undermine the correct classification of patients and the identification of reliable biomarkers for a particular disease. Three methods have been proposed to address the problem: sparse label-noise-robust logistic regression (Rlogreg), robust elastic net based on the least trimmed square (enetLTS), and Ensemble. Ensemble is an ensembled classification based on distinct feature selection and modeling strategies. The accuracy of biomarker selection and outlier detection of these methods needs to be evaluated and compared so that the appropriate method can be chosen. Results The accuracy of variable selection, outlier identification, and prediction of three methods (Ensemble, enetLTS, Rlogreg) were compared for simulated and an RNA-seq dataset. On simulated datasets, Ensemble had the highest variable selection accuracy, as measured by a comprehensive index, and lowest false discovery rate among the three methods. When the sample size was large and the proportion of outliers was ≤5%, the positive selection rate of Ensemble was similar to that of enetLTS. However, when the proportion of outliers was 10% or 15%, Ensemble missed some variables that affected the response variables. Overall, enetLTS had the best outlier detection accuracy with false positive rates <  0.05 and high sensitivity, and enetLTS still performed well when the proportion of outliers was relatively large. With 1% or 2% outliers, Ensemble showed high outlier detection accuracy, but with higher proportions of outliers Ensemble missed many mislabeled samples. Rlogreg and Ensemble were less accurate in identifying outliers than enetLTS. The prediction accuracy of enetLTS was better than that of Rlogreg. Running Ensemble on a subset of data after removing the outliers identified by enetLTS improved the variable selection accuracy of Ensemble. Conclusions When the proportion of outliers is ≤5%, Ensemble can be used for variable selection. When the proportion of outliers is > 5%, Ensemble can be used for variable selection on a subset after removing outliers identified by enetLTS. For outlier identification, enetLTS is the recommended method. In practice, the proportion of outliers can be estimated according to the inaccuracy of the diagnostic methods used.

[1]  S. Wold,et al.  The Collinearity Problem in Linear Regression. The Partial Least Squares (PLS) Approach to Generalized Inverses , 1984 .

[2]  R. Jain,et al.  Secretory leukocyte protease inhibitor (SLPI) as a potential target for inhibiting metastasis of triple-negative breast cancers , 2017, Oncotarget.

[3]  Federico Rotolo,et al.  Empirical extensions of the lasso penalty to reduce the false discovery rate in high‐dimensional Cox regression models , 2016, Statistics in medicine.

[4]  B. Nelson,et al.  Identification and Validation of a Novel Biologics Target in Triple Negative Breast Cancer , 2019, Scientific Reports.

[5]  Chen Zhang,et al.  Methods for labeling error detection in microarrays based on the effect of data perturbation on the regression model , 2009, Bioinform..

[6]  N. Neamati,et al.  Current Challenges and Opportunities in Treating Glioblastoma , 2018, Pharmacological Reviews.

[7]  S. Sathiya Keerthi,et al.  A simple and efficient algorithm for gene selection using sparse logistic regression , 2003, Bioinform..

[8]  S. Sizemore,et al.  GABA(A) Receptor Pi (GABRP) Stimulates Basal-like Breast Cancer Cell Migration through Activation of Extracellular-regulated Kinase 1/2 (ERK1/2)* , 2014, The Journal of Biological Chemistry.

[9]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[10]  J. Benítez,et al.  VGLL1 expression is associated with a triple-negative basal-like phenotype in breast cancer. , 2014, Endocrine-related cancer.

[11]  Jie Wu,et al.  Trefoil factor 1 (TFF1) is a potential prognostic biomarker with functional significance in breast cancers. , 2020, Biomedicine & pharmacotherapy = Biomedecine & pharmacotherapie.

[12]  Niko Beerenwinkel,et al.  Ensemble outlier detection and gene selection in triple-negative breast cancer data , 2018, BMC Bioinformatics.

[13]  A. Hofman,et al.  Loci at chromosomes 13, 19 and 20 influence age at natural menopause , 2009, Nature Genetics.

[14]  Yukun Cui,et al.  Forkhead box C1 boosts triple‐negative breast cancer metastasis through activating the transcription of chemokine receptor‐4 , 2018, Cancer science.

[15]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[16]  G. Milhaud,et al.  The complete sequence of human preprocalcitonin , 1984, FEBS letters.

[17]  L. Maxim,et al.  Screening tests: a review with examples , 2014, Inhalation toxicology.

[18]  Hilde van der Togt,et al.  Publisher's Note , 2003, J. Netw. Comput. Appl..

[19]  F. Zhou,et al.  Epigenetic profiles capturing breast cancer stemness for triple negative breast cancer control. , 2019, Epigenomics.

[20]  J. Mackey,et al.  A fatty acid‐binding protein 7/RXRβ pathway enhances survival and proliferation in triple‐negative breast cancer , 2012, The Journal of pathology.

[21]  E. Flores,et al.  Chromosome 19 miRNA cluster and CEBPB expression specifically mark and potentially drive triple negative breast cancers , 2018, PloS one.

[22]  Amos S Hundert,et al.  Breast cancer subtype dictates DNA methylation and ALDH1A3-mediated expression of tumor suppressor RARRES1 , 2016, Oncotarget.

[23]  Guansheng Zhong,et al.  Identification of key genes as potential biomarkers for triple-negative breast cancer using integrating genomics analysis , 2019, Molecular medicine reports.

[24]  R. D'Agostino Adjustment Methods: Propensity Score Methods for Bias Reduction in the Comparison of a Treatment to a Non‐Randomized Control Group , 2005 .

[25]  Xuan Peng,et al.  A Novel Cytoplasmic Protein with RNA-binding Motifs Is an Autoantigen in Human Hepatocellular Carcinoma , 1999, The Journal of experimental medicine.

[26]  Ata Kabán,et al.  Classification of mislabelled microarrays using robust sparse logistic regression , 2013, Bioinform..

[27]  Madeleine Walker,et al.  Masking unmasked , 2002, The Journal of audiovisual media in medicine.

[28]  Shuangge Ma,et al.  A selective review of robust variable selection with applications in bioinformatics , 2015, Briefings Bioinform..

[29]  P. Rousseeuw,et al.  Robust identification of target genes and outliers in triple-negative breast cancer data , 2018, Statistical methods in medical research.

[30]  A. Naderi SRARP and HSPB7 are epigenetically regulated gene pairs that function as tumor suppressors and predict clinical outcome in malignancies , 2018, Molecular oncology.

[31]  N. Rezaei,et al.  Integrative analyses of triple negative dysregulated transcripts compared with non‐triple negative tumors and their functional and molecular interactions , 2019, Journal of cellular physiology.

[32]  John Elder,et al.  Handbook of Statistical Analysis and Data Mining Applications , 2009 .

[33]  Ailing Zhong,et al.  Molecular profiling of mucinous epithelial ovarian cancer by weighted gene co-expression network analysis. , 2019, Gene.

[34]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[35]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[36]  C. Haglund,et al.  TUMORIGENESIS AND NEOPLASTIC PROGRESSION Astroprincin ( FAM 171 A 1 , C 10 orf 38 ) A Regulator of Human Cell Shape and Invasive Growth , 2018 .

[37]  R. Santella,et al.  Tumor expression of environmental chemical-responsive genes and breast cancer mortality. , 2019, Endocrine-related cancer.

[38]  Stephen T. C. Wong,et al.  Identification of Prognosis-Relevant Subgroups in Patients with Chemoresistant Triple-Negative Breast Cancer , 2013, Clinical Cancer Research.

[39]  M J O'Hare,et al.  Humoral immunity to human breast cancer: antigen definition and quantitative analysis of mRNA expression. , 2001, Cancer immunity.

[40]  S. Efroni,et al.  Shift in GATA3 functions, and GATA3 mutations, control progression and clinical presentation in breast cancer , 2014, Breast Cancer Research.

[41]  Selective loss of phosphoserine aminotransferase 1 (PSAT1) suppresses migration, invasion, and experimental metastasis in triple negative breast cancer , 2019, Clinical & Experimental Metastasis.

[42]  Hiroyuki Arai,et al.  An Alternative Splicing Form of Phosphatidylserine-specific Phospholipase A1 That Exhibits Lysophosphatidylserine-specific Lysophospholipase Activity in Humans* , 1999, The Journal of Biological Chemistry.

[43]  R. Geffers,et al.  IPH‐926 lobular breast cancer cells are triple‐negative but their microarray profile uncovers a luminal subtype , 2013, Cancer science.

[44]  J. H. Lee,et al.  Suppression of metastasis in human breast carcinoma MDA-MB-435 cells after transfection with the metastasis suppressor gene, KiSS-1. , 1997, Cancer research.

[45]  Han-ning Li,et al.  Integrated Bioinformatics Data Analysis Reveals Prognostic Significance Of SIDT1 In Triple-Negative Breast Cancer , 2019, OncoTargets and therapy.

[46]  Peter Filzmoser,et al.  Robust and sparse estimation methods for high-dimensional linear and logistic regression , 2017, 1703.04951.

[47]  Lindsay S. Cooley,et al.  Metalloproteinase‐dependent and ‐independent processes contribute to inhibition of breast cancer cell migration, angiogenesis and liver metastasis by a disintegrin and metalloproteinase with thrombospondin motifs‐15 , 2015, International journal of cancer.

[48]  M. Pencina,et al.  On the C‐statistics for evaluating overall adequacy of risk prediction procedures with censored survival data , 2011, Statistics in medicine.

[49]  A. Ashworth,et al.  Genomic Complexity Profiling Reveals That HORMAD1 Overexpression Contributes to Homologous Recombination Deficiency in Triple-Negative Breast Cancers. , 2015, Cancer discovery.

[50]  Anthony Rhodes,et al.  American Society of Clinical Oncology/College of American Pathologists guideline recommendations for immunohistochemical testing of estrogen and progesterone receptors in breast cancer. , 2010, Archives of pathology & laboratory medicine.

[51]  T. Khoury,et al.  Prostate derived Ets transcription factor and Carcinoembryonic antigen related cell adhesion molecule 6 constitute a highly active oncogenic axis in breast cancer , 2013, Oncotarget.