Gene prioritization through geometric-inspired kernel data fusion

In biology there is often the need to discover the most promising genes, among a large list of candidate genes, to further investigate. While a single data source might not be effective enough, integrating several complementary genomic data sources leads to more accurate prediction. We propose a kernel-based gene prioritization framework using geometric kernel fusion which we have recently developed as a powerful tool for protein fold classification [I]. It has been shown that taking more involved geometry means of their corresponding kernel matrices is less sensitive in dealing with complementary and noisy kernel matrices compared to standard multiple kernel learning methods. Since genomic kernels often encodes the complementary characteristics of biological data, this leads us to research the application of geometric kernel fusion in the gene prioritization task. We utilize an unbiased and prospective benchmark based on the OMIM [2] associations. Experimental results on our prospective benchmark show that our model can improve the accuracy of the state-of-the-art gene prioritization model.

[1]  B. Moshiri,et al.  Prediction of protein submitochondria locations based on data fusion of various features of sequences. , 2011, Journal of theoretical biology.

[2]  Ethem Alpaydin,et al.  Multiple Kernel Learning Algorithms , 2011, J. Mach. Learn. Res..

[3]  Nicholas Ayache,et al.  Geometric Means in a Novel Vector Space Structure on Symmetric Positive-Definite Matrices , 2007, SIAM J. Matrix Anal. Appl..

[4]  Howard Wainer,et al.  Estimating Coefficients in Linear Models: It Don't Make No Nevermind , 1976 .

[5]  Nello Cristianini,et al.  A statistical framework for genomic data fusion , 2004, Bioinform..

[6]  Jing Chen,et al.  Improved human disease candidate gene prioritization using mouse phenotype , 2007, BMC Bioinformatics.

[7]  R. Bhatia Positive Definite Matrices , 2007 .

[8]  Bart De Moor,et al.  An unbiased evaluation of gene prioritization tools , 2012, Bioinform..

[9]  Howard L McLeod,et al.  CANDID: a flexible method for prioritizing candidate genes for complex human traits , 2008, Genetic epidemiology.

[10]  Jesse Davis,et al.  Beegle: from literature mining to disease gene discovery (poster) , 2015 .

[11]  Y. Moreau,et al.  Beegle: from literature mining to disease-gene discovery , 2015, Nucleic acids research.

[12]  P. Bork,et al.  Association of genes to genetically inherited diseases using data mining , 2002, Nature Genetics.

[13]  Tijl De Bie,et al.  Kernel-based data fusion for gene prioritization , 2007, ISMB/ECCB.

[14]  Carol A. Bocchini,et al.  A new face and new challenges for Online Mendelian Inheritance in Man (OMIM®) , 2011, Human mutation.