Annotation of the Zebrafish Genome through an Integrated Transcriptomic and Proteomic Analysis

Accurate annotation of protein-coding genes is one of the primary tasks upon the completion of whole genome sequencing of any organism. In this study, we used an integrated transcriptomic and proteomic strategy to validate and improve the existing zebrafish genome annotation. We undertook high-resolution mass-spectrometry-based proteomic profiling of 10 adult organs, whole adult fish body, and two developmental stages of zebrafish (SAT line), in addition to transcriptomic profiling of six organs. More than 7,000 proteins were identified from proteomic analyses, and ∼69,000 high-confidence transcripts were assembled from the RNA sequencing data. Approximately 15% of the transcripts mapped to intergenic regions, the majority of which are likely long non-coding RNAs. These high-quality transcriptomic and proteomic data were used to manually reannotate the zebrafish genome. We report the identification of 157 novel protein-coding genes. In addition, our data led to modification of existing gene structures including novel exons, changes in exon coordinates, changes in frame of translation, translation in annotated UTRs, and joining of genes. Finally, we discovered four instances of genome assembly errors that were supported by both proteomic and transcriptomic data. Our study shows how an integrative analysis of the transcriptome and the proteome can extend our understanding of even well-annotated genomes.

[1]  Nandini A. Sahasrabuddhe,et al.  A proteogenomic approach to map the proteome of an unsequenced pathogen – Leishmania donovani , 2012, Proteomics.

[2]  Peer Bork,et al.  SMART 7: recent updates to the protein domain annotation resource , 2011, Nucleic Acids Res..

[3]  William Stafford Noble,et al.  Assigning significance to peptides identified by tandem mass spectrometry using decoy databases. , 2008, Journal of proteome research.

[4]  D. Bartel,et al.  Extensive alternative polyadenylation during zebrafish development , 2012, Genome research.

[5]  P. Pevzner,et al.  Target-Decoy Approach and False Discovery Rate: When Things May Go Wrong , 2011, Journal of the American Society for Mass Spectrometry.

[6]  Helga Thorvaldsdóttir,et al.  Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration , 2012, Briefings Bioinform..

[7]  Brad T. Sherman,et al.  DAVID-WS: a stateful web service to facilitate gene/protein list analysis , 2012, Bioinform..

[8]  Debasis Dash,et al.  Proteogenomic analysis of Mycobacterium tuberculosis by high resolution mass spectrometry. , 2011, Molecular & cellular proteomics : MCP.

[9]  M. Mann,et al.  Defining the transcriptome and proteome in three functionally different human cell lines , 2010, Molecular systems biology.

[10]  S. Hubbard,et al.  Addressing Statistical Biases in Nucleotide-Derived Protein Databases for Proteogenomic Search Strategies , 2012, Journal of proteome research.

[11]  A. Pandey,et al.  Stable Isotope Labeling with Amino Acids in Cell Culture (SILAC) for Studying Dynamics of Protein Abundance and Posttranslational Modifications , 2005, Science's STKE.

[12]  Nandini A. Sahasrabuddhe,et al.  A proteogenomic analysis of Anopheles gambiae using high-resolution Fourier transform mass spectrometry. , 2011, Genome research.

[13]  Gennifer E. Merrihew,et al.  Proteogenomic database construction driven from large scale RNA-seq data. , 2014, Journal of proteome research.

[14]  Michael F. Lin,et al.  Systematic identification of long noncoding RNAs expressed during zebrafish embryogenesis. , 2012, Genome research.

[15]  M. Selbach,et al.  Global quantification of mammalian gene expression control , 2011, Nature.

[16]  S. Searle,et al.  Incorporating RNA-seq data into the zebrafish Ensembl genebuild , 2012, Genome research.

[17]  Shivakumar Keerthikumar,et al.  Proteogenomic analysis of Candida glabrata using high resolution mass spectrometry. , 2012, Journal of proteome research.

[18]  M. Krüger,et al.  Global protein expression profiling of zebrafish organs based on in vivo incorporation of stable isotopes. , 2014, Journal of proteome research.

[19]  Madalina M. Drugan,et al.  Strong Cation Exchange-based Fractionation of Lys-N-generated Peptides Facilitates the Targeted Analysis of Post-translational Modifications* , 2009, Molecular & Cellular Proteomics.

[20]  David R. Kelley,et al.  Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks , 2012, Nature Protocols.

[21]  J. Kocher,et al.  CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model , 2013, Nucleic acids research.

[22]  Elena S. Peterson,et al.  VESPA: software to facilitate genomic annotation of prokaryotic organisms through integration of proteomic and transcriptomic data , 2012, BMC Genomics.

[23]  A. Pandey,et al.  A reassessment of the translation initiation codon in vertebrates. , 2001, Trends in genetics : TIG.

[24]  Gerhard G. Thallinger,et al.  A Bioinformatics Approach for Integrated Transcriptomic and Proteomic Comparative Analyses of Model and Non-sequenced Anopheline Vectors of Human Malaria Parasites* , 2012, Molecular & Cellular Proteomics.

[25]  Gary D Bader,et al.  A draft map of the human proteome , 2014, Nature.

[26]  Lior Pachter,et al.  Sequence Analysis , 2020, Definitions.

[27]  D. Matthews,et al.  De novo derivation of proteomes from transcriptomes for transcript and protein identification , 2012, Nature Methods.

[28]  Anton J. Enright,et al.  The zebrafish reference genome sequence and its relationship to the human genome , 2013, Nature.

[29]  Johannes Griss,et al.  The Proteomics Identifications (PRIDE) database and associated tools: status in 2013 , 2012, Nucleic Acids Res..

[30]  Z. Gong,et al.  Transcriptomic Analyses of Sexual Dimorphism of the Zebrafish Liver and the Effect of Sex Hormones , 2013, PloS one.