Gene Selection Based on Mutual Information for the Classification of Multi-class Cancer

With the development of mirocarray technology, microarray data are widely used in the diagnoses of cancer subtypes. However, people are still facing the complicated problem of accurate diagnosis of cancer subtypes. Building classifiers based on the selected key genes from microarray data is a promising approach for the development of microarray technology; yet the selection of non-redundant but relevant genes is complicated. The selected genes should be small enough to allow diagnosis even in regular laboratories and ideally identify genes involved in cancer-specific regulatory pathways. Instead of the traditional gene selection methods used for the classification of two categories of cancers, in the present paper, a novel gene selection algorithm based on mutual information is proposed for the classification of multi-class cancer using microarray data, and the selected key genes are fed into the classifier to classify the cancer subtypes. In our algorithm, mutual information is employed to select key genes related with class distinction. The application on the breast cancer data suggests that the present algorithm can identify the key genes to the BRCA1 mutations/BRCA2 mutations/the sporadic mutations class distinction since the result of our proposed algorithm is promising, because our method can perform the classification of the three types of breast cancer effectively and efficiently. And two more microarray datasets, leukemia and ovarian cancer data, are also employed to validate the performance of our method. The performances of these applications demonstrate the high quality of our method. Based on the present work, our method can be widely used to discriminate different cancer subtypes, which will contribute to the development of technology for the recovery of the cancer.

[1]  Alex M. Andrew,et al.  INFORMATION THEORY, INFERENCE, AND LEARNING ALGORITHMS, by David J. C. MacKay, Cambridge University Press, Cambridge, 2003, hardback, xii + 628 pp., ISBN 0-521-64298-1 (£30.00) , 2004, Robotica.

[2]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[3]  Morton B. Brown,et al.  The Small Sample Behavior of Some Statistics Which Test the Equality of Several Means , 1974 .

[4]  Richard Simon,et al.  Supervised analysis when the number of candidate features (p) greatly exceeds the number of cases (n) , 2003, SKDD.

[5]  Dechang Chen,et al.  Selecting Genes by Test Statistics , 2005, Journal of biomedicine & biotechnology.

[6]  E. Dougherty,et al.  Gene-expression profiles in hereditary breast cancer. , 2001, The New England journal of medicine.

[7]  A.K.C. Wong,et al.  Attribute clustering for grouping, selection, and classification of gene expression data , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[8]  Richard M. Karp,et al.  CLIFF: clustering of high-dimensional microarray data via iterative feature filtering using normalized cuts , 2001, ISMB.

[9]  Sayan Mukherjee,et al.  Feature Selection for SVMs , 2000, NIPS.

[10]  B. L. Welch ON THE COMPARISON OF SEVERAL MEAN VALUES: AN ALTERNATIVE APPROACH , 1951 .

[11]  Johan A. K. Suykens,et al.  Systematic benchmarking of microarray data classification: assessing the role of non-linearity and dimensionality reduction , 2004, Bioinform..

[12]  Nir Friedman,et al.  Tissue classification with gene expression profiles , 2000, RECOMB '00.

[13]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[14]  Jaques Reifman,et al.  Gene selection for multiclass prediction of microarray data , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[15]  Yang Wang,et al.  A global optimal algorithm for class-dependent discretization of continuous data , 2004, Intell. Data Anal..