Evolutionary computation with noise perturbation and cluster analysis to discover biomarker sets

Abstract In biomedical science, data mining techniques have been applied to extract statistically significant and clinically useful information from a given dataset. Finding biomarker gene sets for diseases can aid in understanding disease diagnosis, prognosis and therapy response. Gene expression microarrays have played an important role in such studies and yet, there have also been criticisms in their analysis. Analysis of these datasets presents the high risk of over-fitting (discovering spurious patterns) because of their feature-rich but case-poor nature. This paper describes a GA-SVM hybrid along with Gaussian noise perturbation (with a manual noise gain) to combat over-fitting; determine the strongest signal in the dataset; and discover stable biomarker sets. A colon cancer gene expression microarray dataset is used to show that the strongest signal in the data (optimal noise gain where a modest number of similar candidates emerge) can be found by a binary search. The diversity of candidates (measured by cluster analysis) is reduced by the noise perturbation, indicating some of the patterns are being eliminated (we hope mostly spurious ones). Initial biological validated has been tested and genes have different levels of significance to the candidates; although the discovered biomarker sets should be studied further to ascertain their biological significance and clinical utility. Furthermore, statistical validity displays that the strongest signal in the data is spurious and the discovered biomarker sets should be rejected.

[1]  C. Plass,et al.  HLTF gene silencing in human colon cancer , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[2]  Mark Simpson,et al.  A Genetic Algorithm Approach for Discovering Diagnostic Patterns in Molecular Measurement Data , 2005, 2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[3]  Terence P. Speed,et al.  A comparison of normalization methods for high density oligonucleotide array data based on variance and bias , 2003, Bioinform..

[4]  Larry J. Eshelman,et al.  The CHC Adaptive Search Algorithm: How to Have Safe Search When Engaging in Nontraditional Genetic Recombination , 1990, FOGA.

[5]  T. Ørntoft,et al.  Metastasis-Associated Gene Expression Changes Predict Poor Outcomes in Patients with Dukes Stage B and C Colorectal Cancer , 2009, Clinical Cancer Research.

[6]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[7]  I. Yang,et al.  Molecular staging for survival prediction of colorectal cancer patients. , 2005, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[8]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[9]  Hiroshi Kijima,et al.  Overexpression of the thrombospondin 2 (TSP2) gene modulated by the matrix metalloproteinase family expression and production in human colon carcinoma cell line. , 2003, Oncology reports.

[10]  Aidong Zhang,et al.  Advanced Analysis of Gene Expression Microarray Data , 2006, Science, Engineering, and Biology Informatics.

[11]  B. Vogelstein,et al.  A genetic model for colorectal tumorigenesis , 1990, Cell.

[12]  K. J. Ray Liu,et al.  Ensemble dependence model for classification and prediction of cancer and normal gene expression data , 2005, Bioinform..

[13]  A. Dupuy,et al.  Critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting. , 2007, Journal of the National Cancer Institute.

[14]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[15]  A. Belayew,et al.  The Helicase-Like Transcription Factor and its implication in cancer progression , 2008, Cellular and Molecular Life Sciences.

[16]  David B Jackson,et al.  EMT is the dominant program in human colon cancer , 2011, BMC Medical Genomics.

[17]  T. Yeatman,et al.  Experimentally derived metastasis gene expression profile predicts recurrence and death in patients with colon cancer. , 2010, Gastroenterology.

[18]  Ravi Mathur,et al.  Partial Least Squares (PLS) Applied to Medical Bioinformatics , 2011, Complex Adaptive Systems.

[19]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[20]  Andrea Scozzafava,et al.  Inhibition of carbonic anhydrase IX: a new strategy against cancer. , 2009, Anti-cancer agents in medicinal chemistry.

[21]  Thorsten Joachims,et al.  A support vector method for multivariate performance measures , 2005, ICML.

[22]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[23]  Mitsuo Gen,et al.  Intelligent Engineering Systems through Artificial Neural Networks , 2009 .