Gene ontology based quantitative index to select functionally diverse genes

Among the large number of gene selection algorithms available in literature, the rough set based maximum relevance-maximum significance (RSMRMS) algorithm has been shown to be successful for selecting a set of relevant and significant genes from microarray data. However, the analysis of functional diversity of a gene set is essential to understand the role of genes in a particular disease as well as to evaluate the effectiveness of a gene selection algorithm. In this regard, a gene ontology based quantitative index, termed as degree of functional diversity (DoFD), is proposed to quantify the functional diversity of a set of genes selected by any gene selection algorithm. Moreover, a new gene selection algorithm is presented, integrating judiciously the merits of both DoFD and RSMRMS, to select relevant and significant genes those are also functionally diverse. The performance of the proposed gene selection algorithm, along with a comparison with other gene selection methods, is studied using the proposed DoFD and predictive accuracy of K-nearest neighbor rule and support vector machine on six cancer and one arthritis microarray data sets. An important finding is that the proposed gene ontology based quantitative index can accurately evaluate functional diversity of a set of genes. Also, the proposed gene selection algorithm is shown to be effective for selecting relevant, significant, and functionally diverse genes from microarray data.

[1]  C. Cordon-Cardo,et al.  A multigenic program mediating breast cancer metastasis to bone. , 2003, Cancer cell.

[2]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[3]  Xizhao Wang,et al.  Maximum Ambiguity-Based Sample Selection in Fuzzy Decision Tree Induction , 2012, IEEE Transactions on Knowledge and Data Engineering.

[4]  S. Ramaswamy,et al.  Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. , 2002, Cancer research.

[5]  Janusz Zalewski,et al.  Rough sets: Theoretical aspects of reasoning about data , 1996 .

[6]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[7]  Qinghua Hu,et al.  An efficient gene selection technique for cancer recognition based on neighborhood mutual information , 2010, Int. J. Mach. Learn. Cybern..

[8]  Mark A. Hall,et al.  Correlation-based Feature Selection for Discrete and Numeric Class Machine Learning , 1999, ICML.

[9]  Olivier Bodenreider,et al.  Gene expression correlation and gene ontology-based similarity: an assessment of quantitative relationships , 2004, 2004 Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[10]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[11]  F. Azuaje,et al.  Multiple SVM-RFE for gene selection in cancer classification with expression data , 2005, IEEE Transactions on NanoBioscience.

[12]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[13]  Satoru Miyano,et al.  Null space based feature selection method for gene expression data , 2012, Int. J. Mach. Learn. Cybern..

[14]  Philip S. Yu,et al.  G-SESAME: web tools for GO-term-based gene similarity analysis and knowledge discovery , 2009, Nucleic Acids Res..

[15]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[16]  Ingrid Lönnstedt Replicated microarray data , 2001 .

[17]  Pradipta Maji,et al.  Rough set based maximum relevance-maximum significance criterion and Gene selection from microarray data , 2011, Int. J. Approx. Reason..

[18]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[19]  David R. Hardoon,et al.  Classifying cognitive states of brain activity via one-class neural networks with feature selection by genetic algorithms , 2011, Int. J. Mach. Learn. Cybern..

[20]  Chris H. Q. Ding,et al.  Minimum redundancy feature selection from microarray gene expression data , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[21]  Sankar K. Pal,et al.  Fuzzy–Rough Sets for Information Measures and Selection of Relevant Genes From Microarray Data , 2010, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[22]  Marko Robnik-Sikonja,et al.  Overcoming the Myopia of Inductive Learning Algorithms with RELIEFF , 2004, Applied Intelligence.

[23]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[24]  Pradipta Maji,et al.  Rough Sets for Selection of Molecular Descriptors to Predict Biological Activity of Molecules , 2010, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[25]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[26]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[27]  Ash A. Alizadeh,et al.  Rheumatoid arthritis is a heterogeneous disease: evidence for differences in the activation of the STAT-1 pathway between rheumatoid tissues. , 2003, Arthritis and rheumatism.

[28]  Sankar K. Pal,et al.  Feature Selection Using f-Information Measures in Fuzzy Approximation Spaces , 2010, IEEE Transactions on Knowledge and Data Engineering.

[29]  Pradipta Maji,et al.  $f$-Information Measures for Efficient Selection of Discriminative Genes From Microarray Data , 2009, IEEE Transactions on Biomedical Engineering.

[30]  Todd,et al.  Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning , 2002, Nature Medicine.

[31]  Jonathan Pevsner,et al.  Bioinformatics and functional genomics , 2003 .

[32]  Xi-Zhao Wang,et al.  Improving Generalization of Fuzzy IF--THEN Rules by Maximizing Fuzzy Entropy , 2009, IEEE Transactions on Fuzzy Systems.

[33]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Saso Dzeroski,et al.  Finding explained groups of time-course gene expression profiles with predictive clustering trees. , 2010, Molecular bioSystems.

[35]  R. Spang,et al.  Predicting the clinical status of human breast cancer by using gene expression profiles , 2001, Proceedings of the National Academy of Sciences of the United States of America.