Predicting gene structure changes resulting from genetic variants via exon definition features

Motivation: Genetic variation that disrupts gene function by altering gene splicing between individuals can substantially influence traits and disease. In those cases, accurately predicting the effects of genetic variation on splicing can be highly valuable for investigating the mechanisms underlying those traits and diseases. While methods have been developed to generate high quality computational predictions of gene structures in reference genomes, the same methods perform poorly when used to predict the potentially deleterious effects of genetic changes that alter gene splicing between individuals. Underlying that discrepancy in predictive ability are the common assumptions by reference gene finding algorithms that genes are conserved, well‐formed and produce functional proteins. Results: We describe a probabilistic approach for predicting recent changes to gene structure that may or may not conserve function. The model is applicable to both coding and non‐coding genes, and can be trained on existing gene annotations without requiring curated examples of aberrant splicing. We apply this model to the problem of predicting altered splicing patterns in the genomes of individual humans, and we demonstrate that performing gene‐structure prediction without relying on conserved coding features is feasible. The model predicts an unexpected abundance of variants that create de novo splice sites, an observation supported by both simulations and empirical data from RNA‐seq experiments. While these de novo splice variants are commonly misinterpreted by other tools as coding or non‐coding variants of little or no effect, we find that in some cases they can have large effects on splicing activity and protein products and we propose that they may commonly act as cryptic factors in disease. Availability and implementation: The software is available from geneprediction.org/SGRF. Supplementary information: Supplementary information is available at Bioinformatics online.

[1]  Eran Segal,et al.  Overlapping codes within protein-coding sequences. , 2010, Genome research.

[2]  Michael Q. Zhang,et al.  RNA landscape of evolution for optimal exon and intron discrimination , 2008, Proceedings of the National Academy of Sciences.

[3]  Joseph K. Pickrell,et al.  Noisy Splicing Drives mRNA Isoform Diversity in Human Cells , 2010, PLoS genetics.

[4]  P. Wijermans,et al.  An α‐Thalassemia Phenotype in a Dutch Hindustani, Caused by a New Point Mutation that Creates an Alternative Splice Donor Site in the First Exon of the α2‐Globin Gene , 2004 .

[5]  Andrew McCallum,et al.  An Introduction to Conditional Random Fields for Relational Learning , 2007 .

[6]  Pedro G. Ferreira,et al.  Transcriptome and genome sequencing uncovers functional variation in humans , 2013, Nature.

[7]  Simon Cawley,et al.  Applications of generalized pair hidden Markov models to alignment and gene finding problems. , 2002 .

[8]  Jingyue Ju,et al.  Quantitative evaluation of all hexamers as exonic splicing elements. , 2011, Genome research.

[9]  Steven Salzberg,et al.  JIGSAW: integration of multiple sources of evidence for gene prediction , 2005, Bioinform..

[10]  Hua-Lin Zhou,et al.  Regulation of alternative splicing by local histone modifications: potential roles for RNA-guided mechanisms , 2013, Nucleic acids research.

[11]  M. Alló,et al.  Alternative splicing: a pivotal step between eukaryotic transcription and translation , 2013, Nature Reviews Molecular Cell Biology.

[12]  B. Frey,et al.  Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing , 2008, Nature Genetics.

[13]  L. Chasin,et al.  Computational definition of sequence motifs governing constitutive exon splicing. , 2004, Genes & development.

[14]  F. Thibaud-Nissen,et al.  Araport11: a complete reannotation of the Arabidopsis thaliana reference genome , 2016, bioRxiv.

[15]  S. Berget,et al.  Exon definition may facilitate splice site selection in RNAs with multiple exons. , 1990, Molecular and cellular biology.

[16]  B. Frey,et al.  The human splicing code reveals new insights into the genetic determinants of disease , 2015, Science.

[17]  Ian Korf,et al.  Integrating genomic homology into gene structure prediction , 2001, ISMB.

[18]  William H. Majoros,et al.  Methods for computational gene prediction , 2007 .

[19]  Thaned Kangsamaksin,et al.  Exon Inclusion Is Dependent on Predictable Exonic Splicing Enhancers , 2005, Molecular and Cellular Biology.

[20]  Georg Seelig,et al.  Learning the Sequence Determinants of Alternative Splicing from Millions of Random Sequences , 2015, Cell.

[21]  Andrew McCallum,et al.  An Introduction to Conditional Random Fields , 2010, Found. Trends Mach. Learn..

[22]  A. Masuda,et al.  SRSF1 and hnRNP H antagonistically regulate splicing of COLQ exon 16 in a congenital myasthenic syndrome , 2015, Scientific Reports.

[23]  John C. Marioni,et al.  Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data , 2009, Bioinform..

[24]  C. Burge,et al.  A computational analysis of sequence features involved in recognition of short introns , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[25]  M. Borodovsky,et al.  GeneMark.hmm: new solutions for gene finding. , 1998, Nucleic acids research.

[26]  J. Královičová,et al.  Biased exon/intron distribution of cryptic and de novo 3′ splice sites , 2005, Nucleic acids research.

[27]  P. Radivojac,et al.  MutPred Splice: machine learning-based prediction of exonic variants that disrupt splicing , 2014, Genome Biology.

[28]  Burkhard Morgenstern,et al.  Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources , 2006, BMC Bioinformatics.

[29]  Gene W. Yeo,et al.  Genome-wide analysis reveals SR protein cooperation and competition in regulated splicing. , 2013, Molecular cell.

[30]  J. Tazi,et al.  Exon definition complexes contain the tri-snRNP and can be directly converted into B-like precatalytic splicing complexes. , 2010, Molecular cell.

[31]  David Haussler,et al.  A Generalized Hidden Markov Model for the Recognition of Human Genes in DNA , 1996, ISMB.

[32]  E. Fogarty,et al.  Widespread alternative and aberrant splicing revealed by lariat sequencing , 2015, Nucleic acids research.

[33]  Gene W. Yeo,et al.  Integrative genome‐wide analysis reveals cooperative regulation of alternative splicing by hnRNP proteins , 2012, Cell reports.

[34]  William H. Majoros,et al.  Efficient implementation of a generalized pair hidden Markov model for comparative gene finding , 2005, Bioinform..

[35]  Timothy B. Stockwell,et al.  The Sequence of the Human Genome , 2001, Science.

[36]  M. Amaral,et al.  Cystic fibrosis patients with the 3272‐26A→G mutation have mild disease, leaky alternative mRNA splicing, and CFTR protein at the cell membrane , 1999, Human mutation.

[37]  E. Buratti,et al.  Defective splicing, disease and therapy: searching for master checkpoints in exon definition , 2006, Nucleic acids research.

[38]  Roderic Guigó,et al.  Prescribing splicing , 2015, Science.

[39]  De novo mutations in epileptic encephalopathies , 2013 .

[40]  M. Garcia-Blanco,et al.  Receptor 2 Exon Iiic to Silence Fibroblast Growth Factor Hnrnp H and Hnrnp F Complex with Fox2 Supplemental Material , 2008 .

[41]  S. Berget Exon Recognition in Vertebrate Splicing (*) , 1995, The Journal of Biological Chemistry.

[42]  Elizabeth M. Smigielski,et al.  dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[43]  Bronwen L. Aken,et al.  GENCODE: The reference human genome annotation for The ENCODE Project , 2012, Genome research.

[44]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[45]  L. Chasin,et al.  Context-dependent splicing regulation , 2011, RNA biology.

[46]  Ben Taskar,et al.  Introduction to statistical relational learning , 2007 .

[47]  Mark Yandell,et al.  High‐throughput interpretation of gene structure changes in human and nonhuman resequencing data, using ACE , 2016, Bioinform..

[48]  Temple F. Smith,et al.  Prediction of gene structure. , 1992, Journal of molecular biology.

[49]  Gene W. Yeo,et al.  Inference of Splicing Regulatory Activities by Sequence Neighborhood Analysis , 2006, PLoS genetics.

[50]  Andrew McCallum,et al.  Piecewise Training for Undirected Models , 2005, UAI.

[51]  J. Rossi,et al.  Interaction of musleblind, CUG‐BP1 and hnRNP H proteins in DM1‐associated aberrant IR splicing , 2006, The EMBO journal.

[52]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[53]  Ian Korf,et al.  Gene finding in novel genomes , 2004, BMC Bioinformatics.

[54]  Adrian R. Krainer,et al.  Aberrant 5′ splice sites in human disease genes: mutation pattern, nucleotide structure and comparison of computational tools that predict their utilization , 2007, Nucleic acids research.

[55]  Cole Trapnell,et al.  TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions , 2013, Genome Biology.

[56]  J. Mullikin,et al.  Genomic features defining exonic variants that modulate splicing , 2010, Genome Biology.

[57]  Irmtraud M. Meyer,et al.  Gene structure conservation aids similarity based gene prediction. , 2004, Nucleic acids research.

[58]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[59]  M. Widera,et al.  Genomic HEXploring allows landscaping of novel potential splicing regulatory elements , 2014, Nucleic acids research.

[60]  Dmitri D. Pervouchine,et al.  The human transcriptome across tissues and individuals , 2015, Science.

[61]  I. Vořechovský,et al.  Aberrant 3′ splice sites in human disease genes: mutation pattern, nucleotide structure and comparison of computational tools that predict their utilization , 2006, Nucleic acids research.

[62]  Juan Valcárcel,et al.  Building specificity with nonspecific RNA-binding proteins , 2005, Nature Structural &Molecular Biology.

[63]  F. Cunningham,et al.  The Ensembl Variant Effect Predictor , 2016, Genome Biology.

[64]  J. Cáceres,et al.  The SR protein family of splicing factors: master regulators of gene expression. , 2009, The Biochemical journal.

[65]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.