论文信息 - GIIRA - RNA-Seq driven gene finding incorporating ambiguous reads - 字舞流文

GIIRA - RNA-Seq driven gene finding incorporating ambiguous reads

MOTIVATION The reliable identification of genes is a major challenge in genome research, as further analysis depends on the correctness of this initial step. With high-throughput RNA-Seq data reflecting currently expressed genes, a particularly meaningful source of information has become commonly available for gene finding. However, practical application in automated gene identification is still not the standard case. A particular challenge in including RNA-Seq data is the difficult handling of ambiguously mapped reads. RESULTS We present GIIRA (Gene Identification Incorporating RNA-Seq data and Ambiguous reads), a novel prokaryotic and eukaryotic gene finder that is exclusively based on a RNA-Seq mapping and inherently includes ambiguously mapped reads. GIIRA extracts candidate regions supported by a sufficient number of mappings and reassigns ambiguous reads to their most likely origin using a maximum-flow approach. This avoids the exclusion of genes that are predominantly supported by ambiguous mappings. Evaluation on simulated and real data and comparison with existing methods incorporating RNA-Seq information highlight the accuracy of GIIRA in identifying the expressed genes. AVAILABILITY AND IMPLEMENTATION GIIRA is implemented in Java and is available from https://sourceforge.net/projects/giira/.

Bernhard Y. Renard | Martin S. Lindner | Franziska Zickmann | B. Renard | M. S. Lindner | Franziska Zickmann

[1] Richard Durbin,et al. Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[2] M. Borodovsky,et al. GeneMark.hmm: new solutions for gene finding. , 1998, Nucleic acids research.

[3] Lex Overmars,et al. Reduce Manual Curation by Combining Gene Predictions from Multiple Annotation Engines, a Case Study of Start Codon Prediction , 2013, PloS one.

[4] M. Borodovsky,et al. GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. , 2001, Nucleic acids research.

[5] M. Gerstein,et al. RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[6] Ian Korf,et al. Gene finding in novel genomes , 2004, BMC Bioinformatics.

[7] Joshua N. Adkins,et al. Comparative Omics-Driven Genome Annotation Refinement: Application across Yersiniae , 2012, PloS one.

[8] B. Williams,et al. Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[9] Paul J. Kennedy,et al. Evaluating High-Throughput Ab Initio Gene Finders to Discover Proteins Encoded in Eukaryotic Pathogen Genomes Missed by Laboratory Techniques , 2012, PloS one.

[10] Cole Trapnell,et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. , 2010, Nature biotechnology.

[11] Steven Salzberg,et al. JIGSAW: integration of multiple sources of evidence for gene prediction , 2005, Bioinform..

[12] Wenhan Zhu,et al. Bacillus anthracis genome organization in light of whole transcriptome sequencing , 2010, BMC Bioinformatics.

[13] Chaochun Wei,et al. Using ESTs to improve the accuracy of de novo gene prediction , 2006, BMC Bioinformatics.

[14] R. Durbin,et al. Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[15] Alexander Dekhtyar,et al. Information Retrieval , 2018, Lecture Notes in Computer Science.

[16] Manuel Holtgrewe,et al. Mason – A Read Simulator for Second Generation Sequencing Data , 2010 .

[17] Gonçalo R. Abecasis,et al. The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[18] Qian Wang,et al. Theoretical Prediction and Experimental Verification of Protein-Coding Genes in Plant Pathogen Genome Agrobacterium tumefaciens Strain C58 , 2012, PloS one.

[19] Cole Trapnell,et al. Computational methods for transcriptome annotation and quantification using RNA-seq , 2011, Nature Methods.

[20] Burkhard Morgenstern,et al. Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources , 2006, BMC Bioinformatics.

[21] R. Gibbs,et al. Gene structure in the sea urchin Strongylocentrotus purpuratus based on transcriptome analysis , 2012, Genome research.

[22] Mark Yandell,et al. MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects , 2011, BMC Bioinformatics.

[23] Steven Salzberg,et al. Identifying bacterial genes and endosymbiont DNA with Glimmer , 2007, Bioinform..

[24] Paul Horton,et al. Finding Protein-Coding Genes through Human Polymorphisms , 2013, PloS one.

[25] F. Denoeud,et al. Annotating genomes with massive-scale RNA sequencing , 2008, Genome Biology.

[26] Colin N. Dewey,et al. Discovering Transcription Factor Binding Sites in Highly Repetitive Regions of Genomes with Multi-Read Analysis of ChIP-Seq Data , 2011, PLoS Comput. Biol..

[27] Shane S. Sturrock,et al. Geneious Basic: An integrated and extendable desktop software platform for the organization and analysis of sequence data , 2012, Bioinform..

[28] Orion J. Buske,et al. iReckon: Simultaneous isoform discovery and abundance estimation from RNA-seq data , 2013, Genome research.

[29] Marcel H. Schulz,et al. A Global View of Gene Activity and Alternative Splicing by Deep Sequencing of the Human Transcriptome , 2008, Science.

[30] Manesh B Shah,et al. Expressed peptide tags: an additional layer of data for genome annotation. , 2006, Journal of proteome research.

[31] Steven Salzberg,et al. TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders , 2004, Bioinform..

[32] M S Waterman,et al. Sequence alignment and penalty choice. Review of concepts, case studies and implications. , 1994, Journal of molecular biology.

[33] David R. Kelley,et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks , 2012, Nature Protocols.

[34] E. Birney,et al. EGASP: the human ENCODE Genome Annotation Assessment Project , 2006, Genome Biology.

[35] Java Binding,et al. GNU Linear Programming Kit , 2011 .

[36] Christian Schlötterer,et al. Evaluation of Different Reference Based Annotation Strategies Using RNA-Seq – A Case Study in Drososphila pseudoobscura , 2012, PloS one.

[37] R. Guigó,et al. Evaluation of gene structure prediction programs. , 1996, Genomics.

[38] Cole Trapnell,et al. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions , 2013, Genome Biology.

[39] Thomas Bonfert,et al. A context-based approach to identify the most likely mapping for RNA-seq experiments , 2012, BMC Bioinformatics.

[40] David Haussler,et al. Using native and syntenically mapped cDNA alignments to improve de novo gene finding , 2008, Bioinform..

[41] Joseph K. Pickrell,et al. Understanding mechanisms underlying human gene expression variation with RNA sequencing , 2010, Nature.