Composition bias and the origin of ORFan genes

Motivation: Intriguingly, sequence analysis of genomes reveals that a large number of genes are unique to each organism. The origin of these genes, termed ORFans, is not known. Here, we explore the origin of ORFan genes by defining a simple measure called ‘composition bias’, based on the deviation of the amino acid composition of a given sequence from the average composition of all proteins of a given genome. Results: For a set of 47 prokaryotic genomes, we show that the amino acid composition bias of real proteins, random ‘proteins’ (created by using the nucleotide frequencies of each genome) and ‘proteins’ translated from intergenic regions are distinct. For ORFans, we observed a correlation between their composition bias and their relative evolutionary age. Recent ORFan proteins have compositions more similar to those of random ‘proteins’, while the compositions of more ancient ORFan proteins are more similar to those of the set of all proteins of the organism. This observation is consistent with an evolutionary scenario wherein ORFan genes emerged and underwent a large number of random mutations and selection, eventually adapting to the composition preference of their organism over time. Contact: ron@biocoml.ls.biu.ac.il Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  Robert C. Edgar,et al.  MUSCLE: a multiple sequence alignment method with reduced time and space complexity , 2004, BMC Bioinformatics.

[2]  Yanay Ofran,et al.  Proteins of the same fold and unrelated sequences have similar amino acid composition , 2006, Proteins.

[3]  Kevin R. Thornton,et al.  The origin of new genes: glimpses from the young and old , 2003, Nature Reviews Genetics.

[4]  William R. Taylor,et al.  The rapid generation of mutation data matrices from protein sequences , 1992, Comput. Appl. Biosci..

[5]  Peter J. Rousseeuw,et al.  Robust Regression and Outlier Detection , 2005, Wiley Series in Probability and Statistics.

[6]  Daniel Fischer,et al.  Structural biology sheds light on the puzzle of genomic ORFans. , 2004, Journal of molecular biology.

[7]  John Moult,et al.  Protein family clustering for structural genomics. , 2005, Journal of molecular biology.

[8]  P. Forterre,et al.  A hidden reservoir of integrative elements is the major source of recently acquired foreign genes and ORFans in archaeal and bacterial genomes , 2009, Genome Biology.

[9]  Orna Man,et al.  Proteomic signatures: Amino acid and oligopeptide compositions differentiate among phyla , 2003, Proteins.

[10]  J. Felsenstein,et al.  Mathematics vs. Evolution: Mathematical Evolutionary Theory. , 1989, Science.

[11]  Antonio Lazcano,et al.  The origin of a novel gene through overprinting in Escherichia coli , 2008, BMC Evolutionary Biology.

[12]  B. Rost,et al.  Better prediction of sub‐cellular localization by combining evolutionary and structural information , 2003, Proteins.

[13]  Daniel Fischer,et al.  On the origin of microbial ORFans: quantifying the strength of the evidence for viral lateral transfer , 2006, BMC Evolutionary Biology.

[14]  D. Fischer,et al.  Analysis of singleton ORFans in fully sequenced microbial genomes , 2003, Proteins.

[15]  David S. Eisenberg,et al.  Finding families for genomic ORFans , 1999, Bioinform..

[16]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[17]  Peter J. Rousseeuw,et al.  Robust regression and outlier detection , 1987 .

[18]  H. Ochman,et al.  Start-up entities in the origin of new genes. , 2004, Current opinion in genetics & development.