Discriminant functional gene groups identification with machine learning and prior knowledge

In computational biology, the analysis of high-throughput data poses several issues on the reliability, reproducibility and interpretabil- ity of the results. It has been suggested that one reason for these incon- sistencies may be that in complex diseases, such as cancer, multiple genes belonging to one or more physiological pathways are associated with the outcomes. Thus, a possible approach to improve list interpretability is to integrate biological information from genomic databases in the learning process. Here we propose KDVS, a machine learning based pipeline that incorporates domain biological knowledge a priori to structure the data matrix before the feature selection and classification phases. The pipeline is completed by a final step of semantic clustering and visualization. The clustering phase provides further interpretability of the results, allowing the identification of their biological meaning. To prove the efficacy of this procedure we analyzed a public dataset on prostate cancer.

[1]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[2]  Philip Resnik,et al.  Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language , 1999, J. Artif. Intell. Res..

[3]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[4]  Alex E. Lash,et al.  Gene Expression Omnibus: NCBI gene expression and hybridization array data repository , 2002, Nucleic Acids Res..

[5]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[6]  Carole A. Goble,et al.  Investigating Semantic Similarity Measures Across the Gene Ontology: The Relationship Between Sequence and Annotation , 2003, Bioinform..

[7]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[8]  Eytan Domany,et al.  Outcome signature genes in breast cancer: is there a unique set? , 2004, Breast Cancer Research.

[9]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[10]  Lorenzo Rosasco,et al.  A method for robust variable selection with significance assessment , 2008, ESANN.

[11]  Anne-Laure Boulesteix,et al.  Stability and aggregation of ranked gene lists , 2009, Briefings Bioinform..

[12]  N. López-Bigas,et al.  Biological Convergence of Cancer Signatures , 2009, PloS one.

[13]  Alessandro Verri,et al.  A Regularized Method for Selecting Nested Groups of Relevant Genes from Microarray Data , 2008, J. Comput. Biol..

[14]  Annalisa Barla,et al.  SVS: Data and knowledge integration in computational biology , 2011, 2011 Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[15]  Zhiping Weng,et al.  Gene set enrichment analysis: performance evaluation and usage guidelines , 2012, Briefings Bioinform..

[16]  W. Schulz Molecular Biology of Human Cancers , 2012 .