Non-Homology-Based Prediction of Gene Functions

Advances in genomic sequencing and annotation have make identifying genes straightforward. Predicting the functions of these newly identified genes remains challenging. Often function is predicted from homology. Genes descended from a common ancestral sequence are likely to have common functions. Functional annotation errors can propagate from one species to another. Here we test approaches based on machine learning classification algorithms to predict gene function -- specifically 1,562 GO terms -- from non-homology gene features. Performance varied across GO terms, but, of eight supervised classification algorithms evaluated, random forest based prediction consistently provided the most accurate gene function prediction. Nonhomology-based functional annotation provides complementary strengths to homology-based annotation, with higher average performance among Biological Process GO terms, where homology based functional annotation performs the worst, and weaker performance among Molecular Function GO terms while the accuracy of homology-based functional annotation is highest. Further improvements in prediction accuracy may be possible using annotation provenance to generate higher confidence training datasets and the incorporation of more non-homology feature types. Machine learning non-homology based functional annotation may ultimately prove useful both as a method to assign predicted function to orphan genes which lack functionally characterized homologs, and to identify and correct functional annotation errors propagated through homology-based functional annotations.

[1]  Shoshana D. Brown,et al.  A gold standard set of mechanistically diverse enzyme superfamilies , 2006, Genome Biology.

[2]  James C. Schnable,et al.  Genes Identified by Visible Mutant Phenotypes Show Increased Bias toward One of Two Subgenomes of Maize , 2011, PloS one.

[3]  Ya-Long Guo,et al.  Gene family evolution in green plants with emphasis on the origination and evolution of Arabidopsis thaliana genes. , 2013, The Plant journal : for cell and molecular biology.

[4]  P. Schnable,et al.  Extensive intraspecific gene order and gene structural variations between Mo17 and other maize genomes , 2018, Nature Genetics.

[5]  Yingrui Li,et al.  Construction of the third-generation Zea mays haplotype map , 2015, bioRxiv.

[6]  Max Kuhn,et al.  caret: Classification and Regression Training , 2015 .

[7]  G. Tang,et al.  Indian Hedgehog: A Mechanotransduction Mediator in Condylar Cartilage , 2004, Journal of dental research.

[8]  Candice N. Hirsch,et al.  Using multiple reference genomes to identify and resolve annotation inconsistencies , 2019, BMC Genomics.

[9]  Luís A. Nunes Amaral,et al.  Large-scale investigation of the reasons why potentially important genes are ignored , 2018, PLoS biology.

[10]  Jacob D. Washburn,et al.  Evolutionarily informed deep learning methods for predicting relative transcript abundance from DNA sequence , 2019, Proceedings of the National Academy of Sciences.

[11]  Juan Miguel García-Gómez,et al.  BIOINFORMATICS APPLICATIONS NOTE Sequence analysis Manipulation of FASTQ data with Galaxy , 2005 .

[12]  Haibao Tang,et al.  The Sequenced Angiosperm Genomes and Genome Databases , 2018, Front. Plant Sci..

[13]  Ron Wehrens,et al.  The pls Package: Principal Component and Partial Least Squares Regression in R , 2007 .

[14]  Waqas Ahmed Malik,et al.  Transcriptomic complexity in young maize primary roots in response to low water potentials , 2014, BMC Genomics.

[15]  Michelle C. Stitzer,et al.  Transposable Elements Contribute to Activation of Maize Genes in Response to Abiotic Stress , 2014, bioRxiv.

[16]  Patrick Flick,et al.  GOATOOLS: A Python library for Gene Ontology analyses , 2018, Scientific Reports.

[17]  M. Huynen,et al.  Prediction of protein function and pathways in the genome era , 2004, Cellular and Molecular Life Sciences CMLS.

[18]  Dawn H. Nagel,et al.  The B73 Maize Genome: Complexity, Diversity, and Dynamics , 2009, Science.

[19]  D. Eisenberg,et al.  Detecting protein function and protein-protein interactions from genome sequences. , 1999, Science.

[20]  H. Piepho,et al.  Single-Parent Expression Is a General Mechanism Driving Extensive Complementation of Non-syntenic Genes in Maize Hybrids , 2018, Current Biology.

[21]  F. Leisch FlexMix: A general framework for finite mixture models and latent class regression in R , 2004 .

[22]  Peter Tiffin,et al.  Pervasive gene content variation and copy number variation in maize and its undomesticated progenitor. , 2010, Genome research.

[23]  Anja Paschold,et al.  Nonsyntenic Genes Drive Tissue-Specific Dynamics of Differential, Nonadditive, and Allelic Expression Patterns in Maize Hybrids1[OPEN] , 2016, Plant Physiology.

[24]  Cheng He,et al.  Co‐expression analysis aids in the identification of genes in the cuticular wax pathway in maize , 2018, The Plant journal : for cell and molecular biology.

[25]  Tapio Salakoski,et al.  An expanded evaluation of protein function prediction methods shows an improvement in accuracy , 2016, Genome Biology.

[26]  Wen-Jiu Guo,et al.  Significant Comparative Characteristics between Orphan and Nonorphan Genes in the Rice (Oryza sativa L.) Genome , 2007, Comparative and functional genomics.

[27]  Laurent Bouri,et al.  Ten steps to get started in Genome Assembly and Annotation [version 1; referees: 2 approved] , 2019 .

[28]  M. Campbell,et al.  PANTHER: a library of protein families and subfamilies indexed by function. , 2003, Genome research.

[29]  Jeffrey Ross-Ibarra,et al.  Improved maize reference genome with single-molecule technologies , 2017, Nature.

[30]  E. Marcotte,et al.  Computational genetics: finding protein function by nonhomology methods. , 2000, Current opinion in structural biology.

[31]  Jean-Michel Claverie,et al.  Phydbac "Gene Function Predictor" : a gene annotation tool based on genomic context analysis , 2005, BMC Bioinformatics.

[32]  B. Snel,et al.  Systematic discovery of analogous enzymes in thiamin biosynthesis , 2003, Nature Biotechnology.

[33]  T. Gaasterland,et al.  Microbial genescapes: phyletic and functional patterns of ORF distribution among prokaryotes. , 1998, Microbial & comparative genomics.

[34]  Jason A. Corwin,et al.  Combining Genome-Wide Association Mapping and Transcriptional Networks to Identify Novel Genes Controlling Glucosinolates in Arabidopsis thaliana , 2011, PLoS biology.

[35]  James C. Schnable,et al.  Integration of omic networks in a developmental atlas of maize , 2016, Science.

[36]  Felix Krueger,et al.  Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications , 2011, Bioinform..

[37]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[38]  Jianxin Ma,et al.  Close split of sorghum and maize genome progenitors. , 2004, Genome research.

[39]  A. Valencia Automatic annotation of protein function. , 2005, Current opinion in structural biology.

[40]  Daniel W. A. Buchan,et al.  A large-scale evaluation of computational protein function prediction , 2013, Nature Methods.

[41]  M. Gore,et al.  Network-Guided GWAS Improves Identification of Genes Affecting Free Amino Acids1[OPEN] , 2016, Plant Physiology.

[42]  S. Shiu,et al.  Defining the functional significance of intergenic transcribed regions , 2017, bioRxiv.

[43]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[44]  Chad L. Myers,et al.  Integrating Coexpression Networks with GWAS to Prioritize Causal Genes in Maize[OPEN] , 2017, Plant Cell.

[45]  H. Du,et al.  The R2R3-MYB Transcription Factor Gene Family in Maize , 2012, PloS one.

[46]  Carolyn J. Lawrence-Dill,et al.  MAKER-P: A Tool Kit for the Rapid Creation, Management, and Quality Control of Plant Genome Annotations1[W][OPEN] , 2013, Plant Physiology.

[47]  William N. Venables,et al.  Modern Applied Statistics with S , 2010 .

[48]  C. Ouzounis,et al.  Percolation of annotation errors through hierarchically structured protein sequence databases. , 2005, Mathematical biosciences.

[49]  Eva Huala,et al.  An ontology approach to comparative phenomics in plants , 2015, Plant Methods.

[50]  Amos Bairoch,et al.  The PROSITE database , 2005, Nucleic Acids Res..

[51]  P D Karp,et al.  What we do not know about sequence analysis and sequence databases. , 1998, Bioinformatics.

[52]  Walter R. Gilks,et al.  Modeling the percolation of annotation errors in a database of protein sequences , 2002, Bioinform..

[53]  Patrick S. Schnable,et al.  Maize Inbreds Exhibit High Levels of Copy Number Variation (CNV) and Presence/Absence Variation (PAV) in Genome Content , 2009, PLoS genetics.

[54]  Robert D. Finn,et al.  The Pfam protein families database: towards a more sustainable future , 2015, Nucleic Acids Res..

[55]  Rolf Apweiler,et al.  InterProScan: protein domains identifier , 2005, Nucleic Acids Res..

[56]  Patricia C. Babbitt,et al.  Annotation Error in Public Databases: Misannotation of Molecular Function in Enzyme Superfamilies , 2009, PLoS Comput. Biol..

[57]  S. Zhong,et al.  3D Chromatin Architecture of Large Plant Genomes Determined by Local A/B Compartments. , 2017, Molecular plant.

[58]  Jürg Bähler,et al.  PomBase 2018: user-driven reimplementation of the fission yeast database provides rapid and intuitive access to diverse, interconnected information , 2018, Nucleic Acids Res..

[59]  S. Jackson,et al.  The First 50 Plant Genomes , 2013 .

[60]  James C. Schnable,et al.  Distinct characteristics of genes associated with phenome-wide variation in maize (Zea mays) , 2019 .

[61]  James C. Schnable,et al.  Nonsyntenic Genes Drive Highly Dynamic Complementation of Gene Expression in Maize Hybrids[W] , 2014, Plant Cell.

[62]  N. Provart,et al.  An updated gene atlas for maize reveals organ‐specific and stress‐induced genes , 2019, The Plant journal : for cell and molecular biology.

[63]  Bartek Wilczynski,et al.  Biopython: freely available Python tools for computational molecular biology and bioinformatics , 2009, Bioinform..

[64]  Thomas D. Wu,et al.  GMAP and GSNAP for Genomic Sequence Alignment: Enhancements to Speed, Accuracy, and Functionality , 2016, Statistical Genomics.

[65]  Tyson A. Clark,et al.  Unveiling the complexity of the maize transcriptome by single-molecule long-read sequencing , 2016, Nature Communications.

[66]  R. Sekhon,et al.  An Expanded Maize Gene Expression Atlas based on RNA Sequencing and its Use to Explore Root Development , 2016, The plant genome.

[67]  S. C. Rison,et al.  A universally applicable method of operon map prediction on minimally annotated genomes using conserved genomic context , 2005, Nucleic acids research.

[69]  David M. Goodstein,et al.  Phytozome: a comparative platform for green plant genomics , 2011, Nucleic Acids Res..

[70]  P. Radivojac,et al.  Analysis of protein function and its prediction from amino acid sequence , 2011, Proteins.

[71]  Daniel L. Vera,et al.  The maize W22 genome provides a foundation for functional genomics and transposon biology , 2018, Nature Genetics.

[72]  Jose Espejo Valle-Inclan,et al.  Long-Read Annotation: Automated Eukaryotic Genome Annotation Based on Long-Read cDNA Sequencing1[OPEN] , 2018, Plant Physiology.

[73]  Iddo Friedberg,et al.  Maize GO Annotation—Methods, Evaluation, and Review (maize‐GAMER) , 2018, Plant direct.

[74]  P. Sharp,et al.  The codon Adaptation Index--a measure of directional synonymous codon usage bias, and its potential applications. , 1987, Nucleic acids research.

[75]  V. Wood,et al.  Gene Ontology annotation status of the fission yeast genome: preliminary coverage approaches 100% , 2006, Yeast.

[76]  P. Bork,et al.  Quod erat demonstrandum? The mystery of experimental validation of apparently erroneous computational analyses of protein sequences , 2001, Genome Biology.

[77]  Bin Wang,et al.  Deconvolution Estimation in Measurement Error Models: The R Package decon. , 2011, Journal of statistical software.

[78]  B. Bluhm,et al.  Complementation of CTB7 in the Maize Pathogen Cercospora zeina Overcomes the Lack of In Vitro Cercosporin Production. , 2017, Molecular plant-microbe interactions : MPMI.

[79]  Ute Baumann,et al.  Estimating the annotation error rate of curated GO database sequence annotations , 2007, BMC Bioinformatics.

[80]  Tanya Z. Berardini,et al.  The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools , 2011, Nucleic Acids Res..

[81]  Trevor Hastie,et al.  Regularization Paths for Cox's Proportional Hazards Model via Coordinate Descent. , 2011, Journal of statistical software.

[82]  S. Brenner Errors in genome annotation. , 1999, Trends in genetics : TIG.