Prediction of Drosophila melanogaster gene function using Support Vector Machines

BackgroundWhile the genomes of hundreds of organisms have been sequenced and good approaches exist for finding protein encoding genes, an important remaining challenge is predicting the functions of the large fraction of genes for which there is no annotation. Large gene expression datasets from microarray experiments already exist and many of these can be used to help assign potential functions to these genes. We have applied Support Vector Machines (SVM), a sigmoid fitting function and a stratified cross‐validation approach to analyze a large microarray experiment dataset from Drosophila melanogaster in order to predict possible functions for previously un‐annotated genes. A total of approximately 5043 different genes, or about one‐third of the predicted genes in the D. melanogaster genome, are represented in the dataset and 1854 (or 37%) of these genes are un‐annotated.Results39 Gene Ontology Biological Process (GO‐BP) categories were found with precision value equal or larger than 0.75, when recall was fixed at the 0.4 level. For two of those categories, we have provided additional support for assigning given genes to the category by showing that the majority of transcripts for the genes belonging in a given category have a similar localization pattern during embryogenesis. Additionally, by assessing the predictions using a confidence score, we have been able to provide a putative GO‐BP term for 1422 previously un‐annotated genes or about 77% of the un‐annotated genes represented on the microarray and about 19% of all of the un‐annotated genes in the D. melanogaster genome.ConclusionsOur study successfully employs a number of SVM classifiers, accompanied by detailed calibration and validation techniques, to generate a number of predictions for new annotations for D. melanogaster genes. The applied probabilistic analysis to SVM output improves the interpretability of the prediction results and the objectivity of the validation procedure.

[1]  Nello Cristianini,et al.  Controlling the Sensitivity of Support Vector Machines , 1999 .

[2]  Alexander J. Smola,et al.  Advances in Large Margin Classifiers , 2000 .

[3]  A. Fraser,et al.  A single gene network accurately predicts phenotypic effects of gene perturbation in Caenorhabditis elegans , 2008, Nature Genetics.

[4]  Wensheng Zhang,et al.  Term-tissue specific models for prediction of gene ontology biological processes using transcriptional profiles of aging in drosophila melanogaster , 2008, BMC Bioinformatics.

[5]  Mehmet M. Dalkilic,et al.  Gene networks in Drosophila melanogaster: integrating experimental data to predict gene function , 2009, Genome Biology.

[6]  B. S. Baker,et al.  Gene Expression During the Life Cycle of Drosophila melanogaster , 2002, Science.

[7]  E. Marcotte,et al.  An Improved, Bias-Reduced Probabilistic Functional Gene Network of Baker's Yeast, Saccharomyces cerevisiae , 2007, PloS one.

[8]  Hsuan-Tien Lin,et al.  A note on Platt’s probabilistic outputs for support vector machines , 2007, Machine Learning.

[9]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[10]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Thomas Lengauer,et al.  ROCR: visualizing classifier performance in R , 2005, Bioinform..

[12]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[13]  David Osumi-Sutherland,et al.  FlyBase: enhancing Drosophila Gene Ontology annotations , 2008, Nucleic Acids Res..

[14]  Michael I. Jordan,et al.  A critical assessment of Mus musculus gene function prediction using integrated genomic evidence , 2008, Genome Biology.

[15]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.

[16]  Mu Zhu,et al.  A factor analysis model for functional genomics , 2005, BMC Bioinformatics.

[17]  R. Tibshirani,et al.  Clustering methods for the analysis of DNA microarray data , 1999 .

[18]  M. Gerstein,et al.  Systematic learning of gene functional classes from DNA array expression data by using multilayer perceptrons. , 2002, Genome research.

[19]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[20]  Anthony J. Bonner,et al.  Combining classifiers to predict gene function in Arabidopsis thaliana using large-scale gene expression measurements , 2007, BMC Bioinformatics.

[21]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[22]  P. Tomançak,et al.  Global Analysis of mRNA Localization Reveals a Prominent Role in Organizing Cellular Architecture and Function , 2007, Cell.

[23]  William Stafford Noble,et al.  Support vector machine , 2013 .

[24]  Yudong D. He,et al.  Functional Discovery via a Compendium of Expression Profiles , 2000, Cell.

[25]  B. Frey,et al.  The functional landscape of mouse gene expression , 2004, Journal of biology.

[26]  George Forman,et al.  Apples-to-apples in cross-validation studies: pitfalls in classifier performance measurement , 2010, SKDD.

[27]  Marc Vidal,et al.  A Genome-Wide Gene Function Prediction Resource for Drosophila melanogaster , 2010, PloS one.

[28]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[29]  Lani F. Wu,et al.  Large-scale prediction of Saccharomyces cerevisiae gene function using overlapping transcriptional clusters , 2002, Nature Genetics.

[30]  D. Eisenberg,et al.  A combined algorithm for genome-wide prediction of protein function , 1999, Nature.