Classification of mislabelled microarrays using robust sparse logistic regression

MOTIVATION Previous studies reported that labelling errors are not uncommon in microarray datasets. In such cases, the training set may become misleading, and the ability of classifiers to make reliable inferences from the data is compromised. Yet, few methods are currently available in the bioinformatics literature to deal with this problem. The few existing methods focus on data cleansing alone, without reference to classification, and their performance crucially depends on some tuning parameters. RESULTS In this article, we develop a new method to detect mislabelled arrays simultaneously with learning a sparse logistic regression classifier. Our method may be seen as a label-noise robust extension of the well-known and successful Bayesian logistic regression classifier. To account for possible mislabelling, we formulate a label-flipping process as part of the classifier. The regularization parameter is automatically set using Bayesian regularization, which not only saves the computation time that cross-validation would take, but also eliminates any unwanted effects of label noise when setting the regularization parameter. Extensive experiments with both synthetic data and real microarray datasets demonstrate that our approach is able to counter the bad effects of labelling errors in terms of predictive performance, it is effective at identifying marker genes and simultaneously it detects mislabelled arrays to high accuracy. AVAILABILITY The code is available from http://cs.bham.ac.uk/∼jxb008. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Roberto Alejo,et al.  Analysis of new techniques to obtain quality training sets , 2003, Pattern Recognit. Lett..

[2]  Subhas C. Nandy,et al.  Efficiency of discriminant analysis when initial samples are classified stochastically , 1990, Pattern Recognit..

[3]  Rong Jin,et al.  Multiple Kernel Learning from Noisy Labels by Stochastic Programming , 2012, ICML.

[4]  Li Hsu,et al.  Partially Supervised Learning Using an EM‐Boosting Algorithm , 2004, Biometrics.

[5]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[6]  P. Lachenbruch Discriminant Analysis When the Initial Samples are Misclassified II: Non-Random Misclassification Models , 1974 .

[7]  Ata Kabán,et al.  Label-Noise Robust Logistic Regression and Its Applications , 2012, ECML/PKDD.

[8]  K. Kadota,et al.  Detecting outlying samples in microarray data: A critical assessment of the effect of outliers on sample classification , 2003 .

[9]  Bernhard Schölkopf,et al.  Estimating a Kernel Fisher Discriminant in the Presence of Label Noise , 2001, ICML.

[10]  R. Chhikara,et al.  Linear discriminant analysis with misallocation in training samples , 1984 .

[11]  Eduardo Gasca,et al.  Decontamination of Training Samples for Supervised Pattern Recognition Methods , 2000, SSPR/SPR.

[12]  Enrico Blanzieri,et al.  Detecting potential labeling errors in microarrays by data perturbation , 2006, Bioinform..

[13]  Carla E. Brodley,et al.  Identifying Mislabeled Training Data , 1999, J. Artif. Intell. Res..

[14]  A. Ng Feature selection, L1 vs. L2 regularization, and rotational invariance , 2004, Twenty-first international conference on Machine learning - ICML '04.

[15]  R. Spang,et al.  Predicting the clinical status of human breast cancer by using gene expression profiles , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Andrian Marcus,et al.  Data Cleansing: Beyond Integrity Analysis 1 , 2000 .

[17]  Zhi-Hua Zhou,et al.  Editing Training Data for kNN Classifiers with Neural Network Ensemble , 2004, ISNN.

[18]  A. Levine,et al.  Gene assessment and sample classification for gene expression data using a genetic algorithm/k-nearest neighbor method. , 2001, Combinatorial chemistry & high throughput screening.

[19]  Gavin C. Cawley,et al.  Gene Selection in Cancer Classification using Sparse Logistic Regression with Bayesian Regularisation , 2006 .

[20]  Fabrice Muhlenbach,et al.  Identifying and Handling Mislabelled Instances , 2004, Journal of Intelligent Information Systems.

[21]  David Mackay,et al.  Probable networks and plausible predictions - a review of practical Bayesian methods for supervised neural networks , 1995 .

[22]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[23]  Andrian Marcus,et al.  Data Cleansing: Beyond Integrity Analysis , 2000, IQ.

[24]  Chen Zhang,et al.  Methods for labeling error detection in microarrays based on the effect of data perturbation on the regression model , 2009, Bioinform..

[25]  S. Sathiya Keerthi,et al.  A simple and efficient algorithm for gene selection using sparse logistic regression , 2003, Bioinform..

[26]  Gábor Lugosi,et al.  Learning with an unreliable teacher , 1992, Pattern Recognit..