An Algorithm for Recognizing Mislabeled and Abnormal Samples in Cancer Microarray

Microarray is a high-throughput experimental technology which has been used in many life-science areas especially in medical applications. The sample classification problem is crucial for disease diagnosis and treatment. However, the process of sample labeling can be very complex and partially subjective. Existing studies confirm this phenomenon and show that even a very small number of error samples could deeply degrade the performance of the obtained classifier, particularly when the size of the dataset is small. More and more Microarray data have been collected by organizations or companies and can be used for further investigation, but the detection and correction of mislabeled samples remains hard to be done by hand. The problem we address in this paper is to develop a method for automatic detection of mislabeled samples and correction of the suspect samples. An algorithm for detecting and correcting potential error samples is proposed: Iterative-CLSWE. The algorithm is based on the classification stability of each sample in the whole dataset. The experimental results validate the proposed algorithm. This automatic way for detecting mislabeled and abnormal samples can prove to be significant for large collection of data coming from heterogeneous studies.

[1]  K. Kadota,et al.  Detecting outlying samples in microarray data: A critical assessment of the effect of outliers on sample classification , 2003 .

[2]  Zhou Qifeng,et al.  cDNA Microarray images Gridding based on projection , 2011 .

[3]  S F Altschul,et al.  BRCA1 protein products ... Functional motifs... , 1996, Nature genetics.

[4]  Charles Auffray,et al.  Deciphering cellular states of innate tumor drug responses , 2006, Genome Biology.

[5]  Huiling Chen,et al.  A Novel Framework for Gene Selection , 2011 .

[6]  R. Spang,et al.  Predicting the clinical status of human breast cancer by using gene expression profiles , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Chen Zhang,et al.  Methods for labeling error detection in microarrays based on the effect of data perturbation on the regression model , 2009, Bioinform..

[8]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[9]  S. Takada,et al.  Epigenetic silencing of AXIN2 in colorectal carcinoma with microsatellite instability , 2006, Oncogene.

[10]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[11]  Theodore Johnson,et al.  Fast Computation of 2-Dimensional Depth Contours , 1998, KDD.

[12]  Xuesong Lu,et al.  A simple strategy for detecting outlier samples in microarray data , 2004, ICARCV 2004 8th Control, Automation, Robotics and Vision Conference, 2004..

[13]  Enrico Blanzieri,et al.  Detecting potential labeling errors in microarrays by data perturbation , 2006, Bioinform..

[14]  A. Levine,et al.  Gene assessment and sample classification for gene expression data using a genetic algorithm/k-nearest neighbor method. , 2001, Combinatorial chemistry & high throughput screening.

[15]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[16]  T. Barrette,et al.  ONCOMINE: a cancer microarray database and integrated data-mining platform. , 2004, Neoplasia.

[17]  L. Aaltonen,et al.  Serrated carcinomas form a subclass of colorectal cancer with distinct molecular basis , 2007, Oncogene.

[18]  W. R. Buckland,et al.  Outliers in Statistical Data , 1979 .

[19]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[20]  K. Ho,et al.  A Susceptibility Gene Set for Early Onset Colorectal Cancer That Integrates Diverse Signaling Pathways: Implication for Tumorigenesis , 2007, Clinical Cancer Research.

[21]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.