A novel substitution matrix fitted to the compositional bias in Mollicutes improves the prediction of homologous relationships

BackgroundSubstitution matrices are key parameters for the alignment of two protein sequences, and consequently for most comparative genomics studies. The composition of biological sequences can vary importantly between species and groups of species, and classical matrices such as those in the BLOSUM series fail to accurately estimate alignment scores and statistical significance with sequences sharing marked compositional biases.ResultsWe present a general and simple methodology to build matrices that are especially fitted to the compositional bias of proteins. Our approach is inspired from the one used to build the BLOSUM matrices and is based on learning substitution and amino acid frequencies on real sequences with the corresponding compositional bias. We applied it to the large scale comparison of Mollicute AT-rich genomes. The new matrix, MOLLI60, was used to predict pairwise orthology relationships, as well as homolog families among 24 Mollicute genomes. We show that this new matrix enables to better discriminate between true and false orthologs and improves the clustering of homologous proteins, with respect to the use of the classical matrix BLOSUM62.ConclusionsWe show in this paper that well-fitted matrices can improve the predictions of orthologous and homologous relationships among proteins with a similar compositional bias. With the ever-increasing number of sequenced genomes, our approach could prove valuable in numerous comparative studies focusing on atypical genomes.

[1]  Jonathan E. Allen,et al.  Genome sequence of the human malaria parasite Plasmodium falciparum , 2002, Nature.

[2]  William R. Pearson,et al.  Empirical determination of effective gap penalties for sequence comparison , 2002, Bioinform..

[3]  A. Rechnitzer,et al.  Laboratoire Bordelais de Recherche en Informatique , 1999 .

[4]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[5]  S Henikoff,et al.  Performance evaluation of amino acid substitution matrices , 1993, Proteins.

[6]  S. Pongor,et al.  The quest for orthologs: finding the corresponding gene across genomes. , 2008, Trends in genetics : TIG.

[7]  E. Koonin Orthologs, paralogs, and evolutionary genomics. , 2005, Annual review of genetics.

[8]  References , 1971 .

[9]  Stephen F. Altschul,et al.  The construction of amino acid substitution matrices for the comparison of proteins with non-standard compositions , 2005, Bioinform..

[10]  G. Singer,et al.  Nucleotide bias causes a genomewide bias in the amino acid composition of proteins. , 2000, Molecular biology and evolution.

[11]  M. O. Dayhoff A model of evolutionary change in protein , 1978 .

[12]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[13]  S. Altschul,et al.  The compositional adjustment of amino acid substitution matrices , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[15]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Akash Ranjan,et al.  Genome bias influences amino acid choices: analysis of amino acid substitution and re-compilation of substitution matrices exclusive to an AT-biased genome , 2008, Nucleic acids research.

[17]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[18]  Alain Blanchard,et al.  MolliGen, a database dedicated to the comparative genomics of Mollicutes , 2004, Nucleic Acids Res..

[19]  Kevin Brick,et al.  A novel series of compositionally biased substitution matrices for comparing Plasmodium proteins , 2008, BMC Bioinformatics.

[20]  E. Koonin Orthologs, Paralogs, and Evolutionary Genomics 1 , 2005 .

[21]  S. Altschul,et al.  The estimation of statistical parameters for local alignment score distributions. , 2001, Nucleic acids research.

[22]  T. Gabaldón Large-scale assignment of orthology: back to phylogenetics? , 2008, Genome Biology.

[23]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[24]  W. Pearson Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms. , 1991, Genomics.

[25]  Alain Blanchard,et al.  Evolution of mollicutes: down a bumpy road with twists and turns. , 2007, Research in microbiology.

[26]  Susan L. Epstein,et al.  Composition-Modified Matrices Improve Identification of Homologs of Saccharomyces cerevisiae Low-Complexity Glycoproteins , 2006, Eukaryotic Cell.

[27]  S. Henikoff,et al.  Automated assembly of protein blocks for database searching. , 1991, Nucleic acids research.

[28]  Alain Blanchard,et al.  Life on Arginine for Mycoplasma hominis: Clues from Its Minimal Genome and Comparison with Other Human Urogenital Mycoplasmas , 2009, PLoS genetics.

[29]  Christophe Dessimoz,et al.  Phylogenetic and Functional Assessment of Orthologs Inference Projects and Methods , 2009, PLoS Comput. Biol..

[30]  Eric Maréchal,et al.  Construction of non-symmetric substitution matrices derived from proteomes with biased amino acid distributions. , 2005, Comptes rendus biologies.

[31]  Xavier Robin,et al.  pROC: an open-source package for R and S+ to analyze and compare ROC curves , 2011, BMC Bioinformatics.