Pinstripe: a suite of programs for integrating transcriptomic and proteomic datasets identifies novel proteins and improves differentiation of protein-coding and non-coding genes

MOTIVATION Comparing transcriptomic data with proteomic data to identify protein-coding sequences is a long-standing challenge in molecular biology, one that is exacerbated by the increasing size of high-throughput datasets. To address this challenge, and thereby to improve the quality of genome annotation and understanding of genome biology, we have developed an integrated suite of programs, called Pinstripe. We demonstrate its application, utility and discovery power using transcriptomic and proteomic data from publicly available datasets. RESULTS To demonstrate the efficacy of Pinstripe for large-scale analysis, we applied Pinstripe's reverse peptide mapping pipeline to a transcript library including de novo assembled transcriptomes from the human Illumina Body Atlas (IBA2) and GENCODE v10 gene annotations, and the EBI Proteomics Identifications Database (PRIDE) peptide database. This analysis identified 736 canonical open reading frames (ORFs) supported by three or more PRIDE peptide fragments that are positioned outside any known coding DNA sequence (CDS). Because of the unfiltered nature of the PRIDE database and high probability of false discovery, we further refined this list using independent evidence for translation, including the presence of a Kozak sequence or functional domains, synonymous/non-synonymous substitution ratios and ORF length. Using this integrative approach, we observed evidence of translation from a previously unknown let7e primary transcript, the archetypical lncRNA H19, and a homolog of RD3. Reciprocally, by exclusion of transcripts with mapped peptides or significant ORFs (>80 codon), we identify 32 187 loci with RNAs longer than 2000 nt that are unlikely to encode proteins. AVAILABILITY AND IMPLEMENTATION Pinstripe (pinstripe.matticklab.com) is freely available as source code or a Mono binary. Pinstripe is written in C# and runs under the Mono framework on Linux or Mac OS X, and both under Mono and .Net under Windows. CONTACT m.dinger@garvan.org.au or j.mattick@garvan.org.au SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Michael F. Lin,et al.  PhyloCSF: a comparative genomics method to distinguish protein-coding and non-coding regions , 2010 .

[2]  Rolf Apweiler,et al.  InterProScan - an integration platform for the signature-recognition methods in InterPro , 2001, Bioinform..

[3]  Tim R. Mercer,et al.  Differentiating Protein-Coding and Noncoding RNA: Challenges and Ambiguities , 2008, PLoS Comput. Biol..

[4]  Kanako O. Koyanagi,et al.  Integrative Annotation of 21,037 Human Genes Validated by Full-Length cDNA Clones , 2004, PLoS Biology.

[5]  S. Brunak,et al.  Improved prediction of signal peptides: SignalP 3.0. , 2004, Journal of molecular biology.

[6]  Cole Trapnell,et al.  Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. , 2010, Nature biotechnology.

[7]  Andrew D Kern,et al.  Novel genes derived from noncoding DNA in Drosophila melanogaster are frequently X-linked and exhibit testis-biased expression. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[8]  John S. Mattick,et al.  lncRNAdb: a reference database for long noncoding RNAs , 2010, Nucleic Acids Res..

[9]  David L. Wheeler,et al.  GenBank: update , 2004, Nucleic Acids Res..

[10]  Andrew D Kern,et al.  Evidence for de Novo Evolution of Testis-Expressed Genes in the Drosophila yakuba/Drosophila erecta Clade , 2007, Genetics.

[11]  Tom H. Pringle,et al.  The human genome browser at UCSC. , 2002, Genome research.

[12]  M. Kozak An analysis of 5'-noncoding sequences from 699 vertebrate messenger RNAs. , 1987, Nucleic acids research.

[13]  N. Friedman,et al.  Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data , 2011, Nature Biotechnology.

[14]  D. Tautz,et al.  The evolutionary origin of orphan genes , 2011, Nature Reviews Genetics.

[15]  Michael Legge,et al.  Mammalian Gene PEG10 Expresses Two Reading Frames by High Efficiency –1 Frameshifting in Embryonic-associated Tissues* , 2007, Journal of Biological Chemistry.

[16]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[17]  Byungkook Lee,et al.  PRAC2: A new gene expressed in human prostate and prostate cancer , 2003, The Prostate.

[18]  Henry H. N. Lam,et al.  PeptideAtlas: a resource for target selection for emerging targeted proteomics workflows , 2008, EMBO reports.

[19]  Jae-Woo Cho,et al.  A novel mutation in Hr causes abnormal hair follicle morphogenesis in hairpoor mouse, an animal model for Marie Unna Hereditary Hypotrichosis , 2009, Mammalian Genome.

[20]  María Martín,et al.  The Universal Protein Resource (UniProt) in 2010 , 2010 .

[21]  Nicholas T. Ingolia,et al.  Ribosome Profiling of Mouse Embryonic Stem Cells Reveals the Complexity and Dynamics of Mammalian Proteomes , 2011, Cell.

[22]  L. Matthews,et al.  Conservation of the H19 noncoding RNA and H19-IGF2 imprinting mechanism in therians , 2008, Nature Genetics.

[23]  Raymond K. Auerbach,et al.  A User's Guide to the Encyclopedia of DNA Elements (ENCODE) , 2011, PLoS biology.

[24]  Paulo P. Amaral,et al.  The Reality of Pervasive Transcription , 2011, PLoS biology.

[25]  Aaron R. Quinlan,et al.  Bioinformatics Applications Note Genome Analysis Bedtools: a Flexible Suite of Utilities for Comparing Genomic Features , 2022 .

[26]  Baris E. Suzek,et al.  The Universal Protein Resource (UniProt) in 2010 , 2009, Nucleic Acids Res..

[27]  Subhadra Jalali,et al.  Premature truncation of a novel protein, RD3, exhibiting subnuclear localization is associated with retinal degeneration. , 2006, American journal of human genetics.

[28]  F. Nielsen,et al.  H19 RNA Binds Four Molecules of Insulin-like Growth Factor II mRNA-binding Protein* , 2000, The Journal of Biological Chemistry.

[29]  Alexander Souvorov,et al.  The relationship of protein conservation and sequence length , 2002, BMC Evolutionary Biology.

[30]  I. Nishimoto,et al.  Mechanisms of neuroprotection by a novel rescue factor humanin from Swedish mutant amyloid precursor protein. , 2001, Biochemical and biophysical research communications.

[31]  Anton Nekrutenko,et al.  Oscillating Evolution of a Mammalian Locus with Overlapping Reading Frames: An XLαs/ALEX Relay , 2005, PLoS genetics.

[32]  Daniel Rios,et al.  Ensembl 2011 , 2010, Nucleic Acids Res..

[33]  E. Dees,et al.  The product of the H19 gene may function as an RNA , 1990, Molecular and cellular biology.

[34]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[35]  Cole Trapnell,et al.  Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. , 2011, Genes & development.

[36]  Ning Ma,et al.  BLAST+: architecture and applications , 2009, BMC Bioinformatics.

[37]  S. Salzberg,et al.  The Transcriptional Landscape of the Mammalian Genome , 2005, Science.

[38]  Toshiro K. Ohsumi,et al.  Genome-wide identification of polycomb-associated RNAs by RIP-seq. , 2010, Molecular cell.

[39]  James C. Wright,et al.  Shotgun proteomics aids discovery of novel protein-coding genes, alternative splicing, and "resurrected" pseudogenes in the mouse genome. , 2011, Genome research.

[40]  Lior Pachter,et al.  Sequence Analysis , 2020, Definitions.

[41]  Gautier Koscielny,et al.  Ensembl’s 10th year , 2009, Nucleic Acids Res..

[42]  Dennis K. Gascoigne,et al.  The evolution of RNAs with multiple functions. , 2011, Biochimie.

[43]  Mark Gerstein,et al.  Pseudogene.org: a comprehensive database and comparison platform for pseudogene annotation , 2006, Nucleic Acids Res..

[44]  Yang Liu,et al.  Loss-of-function mutations of an inhibitory upstream ORF in the human hairless transcript cause Marie Unna hereditary hypotrichosis , 2009, Nature Genetics.

[45]  Lennart Martens,et al.  A guide to the Proteomics Identifications Database proteomics data repository , 2009, Proteomics.

[46]  Christopher J. Wilkinson,et al.  Rootletin forms centriole-associated filaments and functions in centrosome cohesion , 2005, The Journal of cell biology.

[47]  Karsten Hokamp,et al.  PubCrawler: keeping up comfortably with PubMed and GenBank , 2004, Nucleic Acids Res..

[48]  Samuel H. Payne,et al.  Discovery and revision of Arabidopsis genes by proteogenomics , 2008, Proceedings of the National Academy of Sciences.

[49]  Andrew P Feinberg,et al.  A nucleolar protein, H19 opposite tumor suppressor (HOTS), is a tumor growth inhibitor encoded by a human imprinted H19 antisense transcript , 2011, Proceedings of the National Academy of Sciences.

[50]  P. Emanuel,et al.  mrtl—A translation/localization regulatory protein encoded within the human c‐myc locus and distributed throughout the endoplasmic and nucleoplasmic reticular network , 2008, Journal of cellular biochemistry.