Predicting Tissue-Specific mRNA and Protein Abundance in Maize: A Machine Learning Approach

Machine learning and modeling approaches have been used to classify protein sequences for a broad set of tasks including predicting protein function, structure, expression, and localization. Some recent studies have successfully predicted whether a given gene is expressed as mRNA or even translated to proteins potentially, but given that not all genes are expressed in every condition and tissue, the challenge remains to predict condition-specific expression. To address this gap, we developed a machine learning approach to predict tissue-specific gene expression across 23 different tissues in maize, solely based on DNA promoter and protein sequences. For class labels, we defined high and low expression levels for mRNA and protein abundance and optimized classifiers by systematically exploring various methods and combinations of k-mer sequences in a two-phase approach. In the first phase, we developed Markov model classifiers for each tissue and built a feature vector based on the predictions. In the second phase, the feature vector was used as an input to a Bayesian network for final classification. Our results show that these methods can achieve high classification accuracy of up to 95% for predicting gene expression for individual tissues. By relying on sequence alone, our method works in settings where costly experimental data are unavailable and reveals useful insights into the functional, evolutionary, and regulatory characteristics of genes.

[1]  David R. Kelley,et al.  Effective gene expression prediction from sequence by integrating long-range interactions , 2021, Nature Methods.

[2]  V. Verendel,et al.  Deep learning suggests that gene expression is encoded in all parts of a co-evolving interacting gene regulatory structure , 2020, Nature communications.

[3]  Md Nafis Ul Alam,et al.  Short k-mer abundance profiles yield robust machine learning features and accurate classifiers for RNA viruses , 2020, bioRxiv.

[4]  A. N’Diaye,et al.  Machine learning analyses of methylation profiles uncovers tissue‐specific gene expression patterns in wheat , 2020, The plant genome.

[5]  E. Buckler,et al.  Deep learning for plant genomics and crop improvement. , 2020, Current opinion in plant biology.

[6]  Jesse R. Walsh,et al.  Tissue-specific gene expression and protein abundance patterns are associated with fractionation bias in maize , 2020, BMC Plant Biology.

[7]  Dick de Ridder,et al.  Designing Eukaryotic Gene Expression Regulation Using Machine Learning. , 2020, Trends in biotechnology.

[8]  Ryan C. Sartor,et al.  Identification of the expressome by machine learning on omics data , 2019, Proceedings of the National Academy of Sciences.

[9]  Marcel H. Schulz,et al.  Integrative prediction of gene expression with chromatin accessibility and conformation data , 2019, Epigenetics & Chromatin.

[10]  Fabian J Theis,et al.  Deep learning: new computational modelling techniques for genomics , 2019, Nature Reviews Genetics.

[11]  J. Nielsen,et al.  Predictive models of eukaryotic transcriptional regulation reveals changes in transcription factor roles and promoter usage between metabolic conditions , 2019, Nucleic acids research.

[12]  M. K. Mejía-Guerra,et al.  A k-mer grammar analysis to uncover maize regulatory architecture , 2019, BMC Plant Biology.

[13]  Jacob D. Washburn,et al.  Evolutionarily informed deep learning methods for predicting relative transcript abundance from DNA sequence , 2019, Proceedings of the National Academy of Sciences.

[14]  Christine G. Elsik,et al.  MaizeGDB 2018: the maize multi-genome genetics and genomics database , 2018, Nucleic Acids Res..

[15]  De-Shuang Huang,et al.  Recurrent Neural Network for Predicting Transcription Factor Binding Sites , 2018, Scientific Reports.

[16]  Kai Wang,et al.  piRNN: deep learning algorithm for piRNA prediction , 2018, PeerJ.

[17]  Carl G. de Boer,et al.  Deciphering eukaryotic cis-regulatory logic with 100 million random promoters , 2017, bioRxiv.

[18]  Bo Wang,et al.  Gramene 2018: unifying comparative genomics and pathway resources for plant research , 2017, Nucleic Acids Res..

[19]  P. D’haeseleer,et al.  Combining multiple functional annotation tools increases coverage of metabolic annotation , 2017, BMC Genomics.

[20]  Łukasz Huminiecki,et al.  Can We Predict Gene Expression by Understanding Proximal Promoter Architecture? , 2017, Trends in biotechnology.

[21]  Kevin L. Schneider,et al.  Improved maize reference genome with single-molecule technologies , 2017, Nature.

[22]  C. Myers,et al.  Co-expression network analysis of duplicate genes in maize (Zea mays L.) reveals no subgenome bias , 2016, BMC Genomics.

[23]  Jian Chen,et al.  Genome-wide mapping of nucleosome positions in Saccharomyces cerevisiae in response to different nitrogen conditions , 2016, Scientific Reports.

[24]  James C. Schnable,et al.  Integration of omic networks in a developmental atlas of maize , 2016, Science.

[25]  Jean-Philippe Vert,et al.  Large-scale machine learning for metagenomics sequence classification , 2015, Bioinform..

[26]  Eibe Frank,et al.  Introducing Machine Learning Concepts with WEKA , 2016, Statistical Genomics.

[27]  William Stafford Noble,et al.  Machine learning applications in genetics and genomics , 2015, Nature Reviews Genetics.

[28]  Yalin Baştanlar,et al.  Introduction to machine learning. , 2014, Methods in molecular biology.

[29]  Pablo Meyer,et al.  Inferring gene expression from ribosomal promoter sequences, a crowdsourcing approach , 2013, Genome research.

[30]  Kevin Y. Yip,et al.  Machine learning and genome annotation: a match meant to be? , 2013, Genome Biology.

[31]  Carson M. Andorf,et al.  Predicting the Binding Patterns of Hub Proteins: A Study Using Yeast Protein Interaction Networks , 2013, PloS one.

[32]  Dawn H. Nagel,et al.  The B73 Maize Genome: Complexity, Diversity, and Dynamics , 2009, Science.

[33]  Panos M. Pardalos,et al.  k-Nearest Neighbor Classification , 2009 .

[34]  Thomas Seidl,et al.  k-Nearest Neighbor Classification , 2009, Encyclopedia of Database Systems.

[35]  V. de Crécy-Lagard,et al.  'Unknown' proteins and 'orphan' enzymes: the missing half of the engineering parts list--and how to find it. , 2009, The Biochemical journal.

[36]  Vasant Honavar,et al.  Exploring inconsistencies in genome-wide protein function annotations: a machine learning approach , 2007, BMC Bioinformatics.

[37]  Nir Friedman,et al.  Bayesian Network Classifiers , 1997, Machine Learning.

[38]  J. R. Quinlan Induction of decision trees , 2004, Machine Learning.

[39]  Vladimir Vapnik,et al.  Support-vector networks , 2004, Machine Learning.

[40]  Jonas S. Almeida,et al.  Alignment-free sequence comparison-a review , 2003, Bioinform..

[41]  D. Eisenberg,et al.  Protein function in the post-genomic era , 2000, Nature.