Using Most Similarity Tree Based Clustering to Select the Top Most Discriminating Genes for Cancer Detection

The development of DNA array technology makes it feasible to cancer detection with DNA array expression data. However, the research is usually plagued with the problem of “curse of dimensionality”, and the capability of discrimination is weakened seriously by the noise and the redundancy that are abundant in these datasets. This paper proposes a hybrid gene selection method for cancer detection based on clustering of most similarity tree (CMST). By this method, a number of non-redundant clusters and the most discriminating gene from each cluster can be acquired. These discriminating genes are then used for training of a perceptron that produces a very efficient classification. In CMST, the Gap statistic is used to determine the optimal similarity measure λ and the number of clusters. And a gene selection method with optimal self-adaptive CMST(OS-CMST) for cancer detection is presented. The experiments show that the gene pattern pre-processing based on CMST not only reduces the dimensionality of the attributes significantly but also improves the classification rate effectively in cancer detection. And the selection scheme based on OS-CMST can acquire the top most discriminating genes.

[1]  Sung-Bae Cho,et al.  Machine Learning in DNA Microarray Analysis for Cancer Classification , 2003, APBC.

[2]  Xiaohua Hu,et al.  Cluster Ensemble and Its Applications in Gene Expression Analysis , 2004, APBC.

[3]  Joaquín Dopazo,et al.  Unsupervised reduction of the dimensionality followed by supervised learning with a perceptron improves the classification of conditions in DNA microarray gene expression data , 2002, Proceedings of the 12th IEEE Workshop on Neural Networks for Signal Processing.

[4]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[5]  Liang Goh,et al.  A Hybrid Feature Selection Approach for Microarray Gene Expression Data , 2006, International Conference on Computational Science.

[6]  Nikola Kasabov,et al.  Evolving connectionist systems , 2002 .

[7]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[8]  Rainer Fuchs,et al.  Analysis of temporal gene expression profiles: clustering by simulated annealing and determining the optimal number of clusters , 2001, Bioinform..

[9]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[10]  Nikola Kasabov,et al.  Evolving Connectionist Systems: Methods and Applications in Bioinformatics, Brain Study and Intelligent Machines , 2002, IEEE Transactions on Neural Networks.

[11]  Lin Ya-ping,et al.  Gene cluster algorithm based on most similarity tree , 2005, Eighth International Conference on High-Performance Computing in Asia-Pacific Region (HPCASIA'05).

[12]  Todd,et al.  Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning , 2002, Nature Medicine.

[13]  Joaquín Dopazo,et al.  Supervised Neural Networks for Clustering Conditions in DNA Array Data After Reducing Noise by Clustering Gene Expression Profiles , 2002 .