Detecting potential labeling errors in microarrays by data perturbation

MOTIVATION Classification is widely used in medical applications. However, the quality of the classifier depends critically on the accurate labeling of the training data. But for many medical applications, labeling a sample or grading a biopsy can be subjective. Existing studies confirm this phenomenon and show that even a very small number of mislabeled samples could deeply degrade the performance of the obtained classifier, particularly when the sample size is small. The problem we address in this paper is to develop a method for automatically detecting samples that are possibly mislabeled. RESULTS We propose two algorithms, a classification-stability algorithm and a leave-one-out-error-sensitivity algorithm for detecting possibly mislabeled samples. For both algorithms, the key structure is the computation of the leave-one-out perturbation matrix. The classification-stability algorithm is based on measuring the stability of the label of a sample with respect to label changes of other samples and the version of this algorithm based on the support vector machine appears to be quite accurate for three real datasets. The suspect list produced by the version is of high quality. Furthermore, when human intervention is not available, the correction heuristic appears to be beneficial.

[1]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[2]  Enrico Blanzieri,et al.  Assessment of SVM Reliability for Microarray Data Analysis , 2004 .

[3]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[4]  Roberto Alejo,et al.  Analysis of new techniques to obtain quality training sets , 2003, Pattern Recognit. Lett..

[5]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[6]  E. Lander,et al.  A molecular signature of metastasis in primary solid tumors , 2003, Nature Genetics.

[7]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[8]  Carla E. Brodley,et al.  Identifying Mislabeled Training Data , 1999, J. Artif. Intell. Res..

[9]  A. Levine,et al.  Gene assessment and sample classification for gene expression data using a genetic algorithm/k-nearest neighbor method. , 2001, Combinatorial chemistry & high throughput screening.

[10]  Michael J. Black,et al.  Robust principal component analysis for computer vision , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[11]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[12]  K. Kadota,et al.  Detecting outlying samples in microarray data: A critical assessment of the effect of outliers on sample classification , 2003 .

[13]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Fabrice Muhlenbach,et al.  Identifying and Handling Mislabelled Instances , 2004, Journal of Intelligent Information Systems.

[15]  Dimitris N. Metaxas,et al.  Distinguishing mislabeled data from correctly labeled data in classifier design , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.

[16]  Michael J. Black,et al.  Robust Principal Component Analysis for Computer Vision , 2001, ICCV.

[17]  R. Spang,et al.  Predicting the clinical status of human breast cancer by using gene expression profiles , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Vic Barnett,et al.  Outliers in Statistical Data , 1980 .

[19]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[20]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[21]  M. Jabri,et al.  Robust principal component analysis , 2000, Neural Networks for Signal Processing X. Proceedings of the 2000 IEEE Signal Processing Society Workshop (Cat. No.00TH8501).

[22]  Philip S. Yu,et al.  Outlier detection for high dimensional data , 2001, SIGMOD '01.

[23]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.