The Relation between Indel Length and Functional Divergence: A Formal Study

Although insertions and deletions (indels) are a common type of evolutionary sequence variation, their origins and their functional consequences have not been comprehensively understood. There is evidence that, on one hand, classical alignment procedures only roughly reflect the evolutionary processes and, on the other hand, that they cause structural changes in the proteins' surfaces. We first demonstrate how to identify alignment gaps that have been introduced by evolution to a statistical significant degree, by means of a novel, sound statistical framework, based on pair hidden Markov models (HMMs). Second, we examine paralogous protein pairs in E. coli, obtained by computation of classical global alignments. Distinguishing between indel and non-indel pairs, according to our novel statistics, revealed that, despite having the same sequence identity, indel pairs are significantly less functionally similar than non-indel pairs, as measured by recently suggested GO based functional distances. This suggests that indels cause more severe functional changes than other types of sequence variation and that indel statistics should be taken into additional account to assess functional similarity between paralogous protein pairs.

[1]  Patricia C Babbitt,et al.  Can sequence determine function? , 2000, Genome Biology.

[2]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Amir Dembo,et al.  Strong limit theorems of empirical functionals for large exceedances of partial sums of i , 1991 .

[4]  B Qian,et al.  Distribution of indel lengths , 2001, Proteins.

[5]  Alexander Schliep,et al.  ProClust: improved clustering of protein sequences with an extended graph-based approach , 2002, ECCB.

[6]  G. Gonnet,et al.  Empirical and structural models for insertions and deletions in the divergent evolution of proteins. , 1993, Journal of molecular biology.

[7]  Y. Chen [The change of serum alpha 1-antitrypsin level in patients with spontaneous pneumothorax]. , 1995, Zhonghua jie he he hu xi za zhi = Zhonghua jiehe he huxi zazhi = Chinese journal of tuberculosis and respiratory diseases.

[8]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[9]  Artem Cherkasov,et al.  Large‐scale survey for potentially targetable indels in bacterial and protozoan proteins , 2005, Proteins.

[10]  Carole A. Goble,et al.  Investigating Semantic Similarity Measures Across the Gene Ontology: The Relationship Between Sequence and Annotation , 2003, Bioinform..

[11]  Mário J. Silva,et al.  Measuring semantic similarity between Gene Ontology terms , 2007, Data Knowl. Eng..

[12]  Philip Resnik,et al.  Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language , 1999, J. Artif. Intell. Res..

[13]  Andrew D. Smith,et al.  SIMPROT: Using an empirically determined indel distribution in simulations of protein evolution , 2005, BMC Bioinformatics.

[14]  S F Altschul,et al.  Local alignment statistics. , 1996, Methods in enzymology.

[15]  D. Schomburg,et al.  Prediction of protein three-dimensional structures in insertion and deletion regions: a procedure for searching data bases of representative protein fragments using geometric scoring criteria. , 1995, Journal of molecular biology.

[16]  Angel Rubio,et al.  Correlation between Gene Expression and GO Semantic Similarity , 2005, TCBB.

[17]  R. Durbin,et al.  Biological sequence analysis: Background on probability , 1998 .

[18]  B. Rost Twilight zone of protein sequence alignments. , 1999, Protein engineering.

[19]  Artem Cherkasov,et al.  Relationship between insertion/deletion (indel) frequency of proteins and essentiality , 2007, BMC Bioinformatics.

[20]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[21]  Xun Gu,et al.  The size distribution of insertions and deletions in human and rodent pseudogenes suggests the logarithmic gap penalty for sequence alignment , 1995, Journal of Molecular Evolution.

[22]  Sean R. Eddy,et al.  Biological sequence analysis: Contents , 1998 .

[23]  Sheldon M. Ross,et al.  A SIMPLE DERIVATION OF EXACT RELIABILITY FORMULAS FOR LINEAR AND CIRCULAR CONSECUTIVE-k-of-n : F SYSTEMS , 1995 .

[24]  Catia Pesquita,et al.  Evaluating GO-based Semantic Similarity Measures , 2007 .

[25]  Artem Cherkasov,et al.  Indel‐based targeting of essential proteins in human pathogens that have close host orthologue(s): Discovery of selective inhibitors for Leishmania donovani elongation factor‐1α , 2007, Proteins.

[26]  Alexey S Kondrashov,et al.  Context of deletions and insertions in human coding sequences , 2004, Human mutation.

[27]  J. Felsenstein,et al.  Inching toward reality: An improved likelihood model of sequence evolution , 2004, Journal of Molecular Evolution.

[28]  Steven A Benner,et al.  Empirical analysis of protein insertions and deletions determining parameters for the correct placement of gaps in protein sequence alignments. , 2004, Journal of molecular biology.

[29]  Dee R. Denver,et al.  High mutation rate and predominance of insertions in the Caenorhabditis elegans nuclear genome , 2004, Nature.

[30]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[31]  Korine S. E. Ung,et al.  Evidence of a Large Novel Gene Pool Associated with Prokaryotic Genomic Islands , 2005, PLoS genetics.

[32]  J. Lake,et al.  Horizontal gene transfer among genomes: the complexity hypothesis. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[33]  Alexandre Z. Caldeira,et al.  Uncertainty in homology inferences: assessing and improving genomic sequence alignment. , 2008, Genome research.

[34]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[35]  Thomas Lengauer,et al.  A new measure for functional similarity of gene products based on Gene Ontology , 2006, BMC Bioinformatics.

[36]  Artem Cherkasov,et al.  Selective targeting of indel‐inferred differences in spatial structures of highly homologous proteins , 2005, Proteins.