Comprehensive evaluation of protein-coding sORFs prediction based on a random sequence strategy.

Background: Small open reading frames (sORFs) with protein-coding ability present unprecedented challenge for genome annotation because of their short sequence and low expression level. In the past decade, only several prediction methods have been proposed for discovery of protein-coding sORFs and lack of objective and uniform negative datasets has become an important obstacle to sORFs prediction. The prediction efficiency of current sORFs prediction methods needs to be further evaluated to provide better research strategies for protein-coding sORFs discovery. Methods: In this work, nine mainstream existing methods for predicting protein-coding potential of ORFs are comprehensively evaluated based on a random sequence strategy. Results: The results show that the current methods perform poorly on different sORFs datasets. For comparison, a sequence based prediction algorithm trained on prokaryotic sORFs is proposed and its better prediction performance indicates that the random sequence strategy can provide feasible ideas for protein-coding sORFs predictions. Conclusions: As a kind of important functional genomic element, discovery of protein-coding sORFs has shed light on the dark proteomes. This evaluation work indicates that there is an urgent need for developing specialized prediction tools for protein-coding sORFs in both eukaryotes and prokaryotes. It is expected that the present work may provide novel ideas for future sORFs researches.

[1]  J. Kelly,et al.  Identifying New Small Proteins in Escherichia coli , 2018, Proteomics.

[2]  Xiaoxue Tong,et al.  CPPred: coding potential prediction based on the global description of RNA sequence , 2019, Nucleic acids research.

[3]  S. Devkota Big data and tiny proteins: shining a light on the dark corners of the gut microbiome , 2019, Nature Reviews Gastroenterology & Hepatology.

[4]  Nicholas T. Ingolia,et al.  Ribosome Profiling Provides Evidence that Large Noncoding RNAs Do Not Encode Proteins , 2013, Cell.

[5]  Feng-Biao Guo,et al.  ZCURVE: a new system for recognizing protein-coding genes in bacterial and archaeal genomes. , 2003, Nucleic acids research.

[6]  S. Elsässer,et al.  Revisiting sORFs: overcoming challenges to identify and characterize functional microproteins , 2021, The FEBS journal.

[7]  Maxim N. Shokhirev,et al.  Accurate annotation of human protein-coding small open reading frames , 2019, Nature Chemical Biology.

[8]  C. Kaleta,et al.  Discovery of novel community-relevant small proteins in a simplified human intestinal microbiome , 2021, Microbiome.

[9]  Cuihong Wan,et al.  Identification and analysis of small proteins and short open reading frame encoded peptides in Hep3B cell. , 2020, Journal of proteomics.

[10]  Michael Gribskov,et al.  MiPepid: MicroPeptide identification tool using machine learning , 2019, BMC Bioinformatics.

[11]  Yi Zhao,et al.  Utilizing sequence intrinsic composition to classify protein-coding and long non-coding transcripts , 2013, Nucleic acids research.

[12]  Chee Keong Kwoh,et al.  DeepCPP: a deep neural network based on nucleotide bias information and minimum distribution similarity feature selection for RNA coding potential prediction , 2020, Briefings Bioinform..

[13]  Ying Gao,et al.  Bioinformatics Applications Note Sequence Analysis Cd-hit Suite: a Web Server for Clustering and Comparing Biological Sequences , 2022 .

[14]  Aimin Li,et al.  PLEK: a tool for predicting long non-coding RNAs and messenger RNAs based on an improved k-mer scheme , 2014, BMC Bioinformatics.

[15]  Stephen C. Cannon,et al.  A peptide encoded by a transcript annotated as long noncoding RNA enhances SERCA activity in muscle , 2016, Science.

[16]  Xavier Roucou,et al.  Small Proteins Encoded by Unannotated ORFs are Rising Stars of the Proteome, Confirming Shortcomings in Genome Annotations and Current Vision of an mRNA , 2018, Proteomics.

[17]  Gerben Menschaert,et al.  Using the sORFs.Org Database , 2018, Current protocols in bioinformatics.

[18]  R. Flavell,et al.  The Translation of Non-Canonical Open Reading Frames Controls Mucosal Immunity , 2018, Nature.

[19]  Xinqiang Yin,et al.  Mining for missed sORF-encoded peptides , 2019, Expert review of proteomics.

[20]  G. Storz,et al.  Escherichia coli Small Proteome , 2020, EcoSal Plus.

[21]  Emily M. Strait,et al.  The arabidopsis information resource: Making and mining the “gold standard” annotated reference plant genome , 2015, Genesis.

[22]  M. Albà,et al.  Conserved regions in long non-coding RNAs contain abundant translation and protein–RNA interaction signatures , 2019, NAR genomics and bioinformatics.

[23]  Georgios A. Pavlopoulos,et al.  Large-Scale Analyses of Human Microbiomes Reveal Thousands of Small, Novel Genes , 2019, Cell.

[24]  M. Brunet,et al.  Reconsidering proteomic diversity with functional investigation of small ORFs and alternative ORFs. , 2020, Experimental cell research.

[25]  John M. Shelton,et al.  A Micropeptide Encoded by a Putative Long Noncoding RNA Regulates Muscle Performance , 2015, Cell.

[26]  Wen J. Li,et al.  RefSeq: an update on prokaryotic genome annotation and curation , 2017, Nucleic Acids Res..

[27]  Xiao Sun,et al.  An Integrative Method for Identifying the Over-Annotated Protein-Coding Genes in Microbial Genomes , 2011, DNA research : an international journal for rapid publication of reports on genes and genomes.

[28]  J. Kocher,et al.  CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model , 2013, Nucleic acids research.

[29]  J. Lawrence When ELFs are ORFs, but don't act like them. , 2003, Trends in genetics : TIG.

[30]  Ge Gao,et al.  CPC2: a fast and accurate coding potential calculator based on sequence intrinsic features , 2017, Nucleic Acids Res..

[31]  G. Storz,et al.  Alternative ORFs and small ORFs: shedding light on the dark proteome. , 2019, Nucleic acids research.

[32]  E. Bornberg-Bauer,et al.  Fact or fiction: updates on how protein-coding genes might emerge de novo from previously non-coding DNA , 2017, F1000Research.

[33]  Vladimir B. Bajic,et al.  Characterization and identification of long non-coding RNAs based on feature relationship , 2019, Bioinform..

[34]  J. Couso,et al.  Classification and function of small open reading frames , 2017, Nature Reviews Molecular Cell Biology.

[35]  P. Xu,et al.  Advances in small protein identification , 2017 .

[36]  Tetsuya Sakurai,et al.  sORF finder: a program package to identify small open reading frames with high coding potential , 2010, Bioinform..

[37]  Xu Hong,et al.  CPPred-sORF: Coding Potential Prediction of sORF based on non-AUG , 2020, bioRxiv.

[38]  Song Liu,et al.  Small open reading frames: current prediction techniques and future prospect. , 2011, Current protein & peptide science.

[39]  G. Menschaert,et al.  The hunt for sORFs: A multidisciplinary strategy. , 2020, Experimental cell research.