Data Mining Based on Colon Cancer Gene Expression Profiles

This research is based on biological information theory. In order to study the selection of colon cancer samples in normal samples and the classification of information gene, the use of pattern recognition and data mining methods were applied to analyze gene expression data for colon cancer. Firstly, signal to noise ratio (SNR) and the Bhattacharyya distance (BHA) were used to remove the irrelevant genes and noise, on the basis of deletion by mistake. After that, 100 information genes could be obtained respectively. Secondly, we calculate the union set of the 200 information genes called union C, and 102 information genes left. Thirdly, the minimum redundancy maximum relevance (MRMR) method was used to search for the information gene set in the union C. Finally, support vector machine (SVM) was used as the classifier to identify normal samples from colon cancer samples and 12 information genes were extracted based on the average classification rate. Several random sampling results showed that 12 information gene extracted in the study can classify normal samples and colon cancer samples at a high correct rate of 93.70%.

[1]  R. Elston,et al.  The investigation of linkage between a quantitative trait and a marker locus , 1972, Behavior genetics.

[2]  Chris H. Q. Ding,et al.  Minimum redundancy feature selection from microarray gene expression data , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[3]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[4]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[5]  J. Downing,et al.  Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. , 2002, Cancer cell.

[6]  M. Xiong,et al.  Recursive partitioning for tumor classification with gene expression microarray data , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Chris H. Q. Ding,et al.  Minimum Redundancy Feature Selection from Microarray Gene Expression Data , 2005, J. Bioinform. Comput. Biol..

[8]  Li Ying-xin,et al.  Informative Genes Selection for Colon Tumor Based on Gene Expression Profiles , 2006 .

[9]  Vladimir Pavlovic,et al.  RankGene: identification of diagnostic genes based on expression data , 2003, Bioinform..

[10]  Xu Shang-le A Feature Selection Method for Colon Tumor Based on Gene Expression Profiles , 2008 .