Methods for labeling error detection in microarrays based on the effect of data perturbation on the regression model

MOTIVATION Mislabeled samples often appear in gene expression profile because of the similarity of different sub-type of disease and the subjective misdiagnosis. The mislabeled samples deteriorate supervised learning procedures. The LOOE-sensitivity algorithm is an approach for mislabeled sample detection for microarray based on data perturbation. However, the failure of measuring the perturbing effect makes the LOOE-sensitivity algorithm a poor performance. The purpose of this article is to design a novel detection method for mislabeled samples of microarray, which could take advantage of the measuring effect of data perturbations. RESULTS To measure the effect of data perturbation, we define an index named perturbing influence value (PIV), based on the support vector machine (SVM) regression model. The Column Algorithm (CAPIV), Row Algorithm (RAPIV) and progressive Row Algorithm (PRAPIV) based on the PIV value are proposed to detect the mislabeled samples. Experimental results obtained by using six artificial datasets and five microarray datasets demonstrate that all proposed methods in this article are superior to LOOE-sensitivity. Moreover, compared with the simple SVM and CL-stability, the PRAPIV algorithm shows an increase in precision and high recall. AVAILABILITY The program and source code (in JAVA) are publicly available at http://ccst.jlu.edu.cn/CSBG/PIVS/index.htm

[1]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[2]  Roberto Alejo,et al.  Analysis of new techniques to obtain quality training sets , 2003, Pattern Recognit. Lett..

[3]  T. Poggio,et al.  Prediction of central nervous system embryonal tumour outcome based on gene expression , 2002, Nature.

[4]  Eric S. Lander,et al.  Genomic analysis of metastasis reveals an essential role for RhoC , 2000, Nature.

[5]  References , 1971 .

[6]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[7]  J. Welsh,et al.  Analysis of gene expression identifies candidate markers and pharmacological targets in prostate cancer. , 2001, Cancer research.

[8]  R. Spang,et al.  Predicting the clinical status of human breast cancer by using gene expression profiles , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[9]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[10]  Tak-Hong Cheung,et al.  Expression genomics of cervical cancer: molecular classification and prediction of radiotherapy response by DNA microarray. , 2003, Clinical cancer research : an official journal of the American Association for Cancer Research.

[11]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[12]  T. H. Bø,et al.  New feature subset selection procedures for classification of expression profiles , 2002, Genome Biology.

[13]  Fabrice Muhlenbach,et al.  Identifying and Handling Mislabelled Instances , 2004, Journal of Intelligent Information Systems.

[14]  Dimitris N. Metaxas,et al.  Distinguishing mislabeled data from correctly labeled data in classifier design , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.

[15]  Bernhard Schölkopf,et al.  A tutorial on support vector regression , 2004, Stat. Comput..

[16]  Igor V. Tetko,et al.  Optimization models for cancer classification: extracting gene interaction information from microarray expression data , 2004, Bioinform..

[17]  Carla E. Brodley,et al.  Identifying Mislabeled Training Data , 1999, J. Artif. Intell. Res..

[18]  Wensheng Zhang,et al.  A method for predicting disease subtypes in presence of misclassification among training samples using gene expression: application to human breast cancer , 2006, Bioinform..

[19]  Roland Eils,et al.  Prediction of clinical outcome and biological characterization of neuroblastoma by expression profiling , 2004, Oncogene.

[20]  Enrico Blanzieri,et al.  Detecting potential labeling errors in microarrays by data perturbation , 2006, Bioinform..

[21]  K. Kadota,et al.  Detecting outlying samples in microarray data: A critical assessment of the effect of outliers on sample classification , 2003 .