GIIRA - RNA-Seq driven gene finding incorporating ambiguous reads

MOTIVATION The reliable identification of genes is a major challenge in genome research, as further analysis depends on the correctness of this initial step. With high-throughput RNA-Seq data reflecting currently expressed genes, a particularly meaningful source of information has become commonly available for gene finding. However, practical application in automated gene identification is still not the standard case. A particular challenge in including RNA-Seq data is the difficult handling of ambiguously mapped reads. RESULTS We present GIIRA (Gene Identification Incorporating RNA-Seq data and Ambiguous reads), a novel prokaryotic and eukaryotic gene finder that is exclusively based on a RNA-Seq mapping and inherently includes ambiguously mapped reads. GIIRA extracts candidate regions supported by a sufficient number of mappings and reassigns ambiguous reads to their most likely origin using a maximum-flow approach. This avoids the exclusion of genes that are predominantly supported by ambiguous mappings. Evaluation on simulated and real data and comparison with existing methods incorporating RNA-Seq information highlight the accuracy of GIIRA in identifying the expressed genes. AVAILABILITY AND IMPLEMENTATION GIIRA is implemented in Java and is available from https://sourceforge.net/projects/giira/.

[1]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[2]  M. Borodovsky,et al.  GeneMark.hmm: new solutions for gene finding. , 1998, Nucleic acids research.

[3]  Lex Overmars,et al.  Reduce Manual Curation by Combining Gene Predictions from Multiple Annotation Engines, a Case Study of Start Codon Prediction , 2013, PloS one.

[4]  M. Borodovsky,et al.  GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. , 2001, Nucleic acids research.

[5]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[6]  Ian Korf,et al.  Gene finding in novel genomes , 2004, BMC Bioinformatics.

[7]  Joshua N. Adkins,et al.  Comparative Omics-Driven Genome Annotation Refinement: Application across Yersiniae , 2012, PloS one.

[8]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[9]  Paul J. Kennedy,et al.  Evaluating High-Throughput Ab Initio Gene Finders to Discover Proteins Encoded in Eukaryotic Pathogen Genomes Missed by Laboratory Techniques , 2012, PloS one.

[10]  Cole Trapnell,et al.  Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. , 2010, Nature biotechnology.

[11]  Steven Salzberg,et al.  JIGSAW: integration of multiple sources of evidence for gene prediction , 2005, Bioinform..

[12]  Wenhan Zhu,et al.  Bacillus anthracis genome organization in light of whole transcriptome sequencing , 2010, BMC Bioinformatics.

[13]  Chaochun Wei,et al.  Using ESTs to improve the accuracy of de novo gene prediction , 2006, BMC Bioinformatics.

[14]  R. Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[15]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[16]  Manuel Holtgrewe,et al.  Mason – A Read Simulator for Second Generation Sequencing Data , 2010 .

[17]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[18]  Qian Wang,et al.  Theoretical Prediction and Experimental Verification of Protein-Coding Genes in Plant Pathogen Genome Agrobacterium tumefaciens Strain C58 , 2012, PloS one.

[19]  Cole Trapnell,et al.  Computational methods for transcriptome annotation and quantification using RNA-seq , 2011, Nature Methods.

[20]  Burkhard Morgenstern,et al.  Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources , 2006, BMC Bioinformatics.

[21]  R. Gibbs,et al.  Gene structure in the sea urchin Strongylocentrotus purpuratus based on transcriptome analysis , 2012, Genome research.

[22]  Mark Yandell,et al.  MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects , 2011, BMC Bioinformatics.

[23]  Steven Salzberg,et al.  Identifying bacterial genes and endosymbiont DNA with Glimmer , 2007, Bioinform..

[24]  Paul Horton,et al.  Finding Protein-Coding Genes through Human Polymorphisms , 2013, PloS one.

[25]  F. Denoeud,et al.  Annotating genomes with massive-scale RNA sequencing , 2008, Genome Biology.

[26]  Colin N. Dewey,et al.  Discovering Transcription Factor Binding Sites in Highly Repetitive Regions of Genomes with Multi-Read Analysis of ChIP-Seq Data , 2011, PLoS Comput. Biol..

[27]  Shane S. Sturrock,et al.  Geneious Basic: An integrated and extendable desktop software platform for the organization and analysis of sequence data , 2012, Bioinform..

[28]  Orion J. Buske,et al.  iReckon: Simultaneous isoform discovery and abundance estimation from RNA-seq data , 2013, Genome research.

[29]  Marcel H. Schulz,et al.  A Global View of Gene Activity and Alternative Splicing by Deep Sequencing of the Human Transcriptome , 2008, Science.

[30]  Manesh B Shah,et al.  Expressed peptide tags: an additional layer of data for genome annotation. , 2006, Journal of proteome research.

[31]  Steven Salzberg,et al.  TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders , 2004, Bioinform..

[32]  M S Waterman,et al.  Sequence alignment and penalty choice. Review of concepts, case studies and implications. , 1994, Journal of molecular biology.

[33]  David R. Kelley,et al.  Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks , 2012, Nature Protocols.

[34]  E. Birney,et al.  EGASP: the human ENCODE Genome Annotation Assessment Project , 2006, Genome Biology.

[35]  Java Binding,et al.  GNU Linear Programming Kit , 2011 .

[36]  Christian Schlötterer,et al.  Evaluation of Different Reference Based Annotation Strategies Using RNA-Seq – A Case Study in Drososphila pseudoobscura , 2012, PloS one.

[37]  R. Guigó,et al.  Evaluation of gene structure prediction programs. , 1996, Genomics.

[38]  Cole Trapnell,et al.  TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions , 2013, Genome Biology.

[39]  Thomas Bonfert,et al.  A context-based approach to identify the most likely mapping for RNA-seq experiments , 2012, BMC Bioinformatics.

[40]  David Haussler,et al.  Using native and syntenically mapped cDNA alignments to improve de novo gene finding , 2008, Bioinform..

[41]  Joseph K. Pickrell,et al.  Understanding mechanisms underlying human gene expression variation with RNA sequencing , 2010, Nature.