Percent sequence identity; the need to be explicit.

Similarity between a pair of aligned biological sequences is represented as sequence identity: the number of aligned positions where the matching characters (e.g., amino acids in proteins) are identical. An evolutionary relationship between a pair of sequences is usually inferred on the basis of high sequence identity between them. Also, that proteins in the same homologous family share a common fold means that sequence identity can be used to identify proteins with similar three-dimensional structures. This approach, known as comparative protein modeling (Blundell et al., 1987xKnowledge-based prediction of protein structures and the design of novel molecules. Blundell, T.L, Sibanda, B.L, Sternberg, M.J.E, and Thornton, J.M. Nature. 1987; 326: 347–352Crossref | PubMedSee all References(Blundell et al., 1987), is currently by far the most accurate method for predicting a protein 3D structure from sequence. The quality of models built by the comparative approach depends primarily on the sequence identity between the sequence to be modeled (target) and the parent or structural template(s) (Baker and Sali, 2001xProtein structure prediction and structural genomics. Baker, D and Sali, A. Science. 2001; 294: 93–96Crossref | PubMed | Scopus (944)See all References(Baker and Sali, 2001).Since all protein sequences are not of the same length, it is useful to correct sequence identity for length and express it as percent identity (PID). Given that the concept of PID is at the heart of comparative biological sequence analysis, it is somewhat surprising that there is no consensus on the appropriate choice of denominator for normalization of sequence identity. I am aware of the use of four denominators:1.Length of the shorter sequence (L1).2.Number of aligned positions, i.e., alignment length (includes gaps, if any) (L2).3.Number of aligned residue pairs, i.e., identities and nonidentities (excludes gaps, if any) (L3).4.Arithmetic mean sequence length (L4).Clearly, PID is operationally defined. Unfortunately, PIDs are almost always quoted without the choice of denominator made explicit. Of course, we can simply say it is sloppy to do so but is it more than that?One way to answer that question is to examine the behavior of the four denominators for reliable 3D structure-based sequence alignments of protein homologous families. Here I used the December 1, 2003, release of HOMSTRAD (Mizuguchi et al., 1998xHomstrad (a database of protein structure alignments for homologous families) . Mizuguchi, K, Deane, C.M, Blundell, T.L, and Overington, J.P. Protein Sci. 1998; 7: 2469–2471Crossref | PubMedSee all References(Mizuguchi et al., 1998): this comprises 1032 families. For those families consisting of >2 structures, I consider all possible pairwise sequence alignments as defined by the HOMSTRAD 3D structure-based multiple sequence alignment. (For each pair, I ignore any matching gap characters defined by the multiple sequence alignment.)I used the linear correlation coefficient to describe the relationships between the four denominators for all sequence pairs (N = 9539) (Table 1(Table 1): the most similar pair of denominators is L1 and L4, while the least similar pair is L2 and L3. The latter result indicates that the number of gaps within an alignment is an important statistic in its own right. L4 is the representative denominator: it is most similar to the three others.Table 1Similarity Matrix of Four PID Denominators (L1–L4)L1L2L3L4L110.9780.990.996L20.97810.9490.99L30.990.94910.984L40.9960.990.9841Each element is the linear correlation coefficient (1032 protein homologous families, 9539 sequence pairs). The four PID denominators (L1–L4) are defined in the text.The correlation coefficient characterizes as a single number the relationships between the four denominators for all 9539 sequence pairs. However, it is useful to consider the spread of the denominators for each pair. So, I used the coefficient of variation (CV) to examine the relative spread of the four denominators for each sequence pair (CV is defined as the standard deviation as a percentage of the mean). 75.5% of the sequence pairs have CV of the four denominators 40%, while the largest CV is 66.8%.PID is a key concept for classification of proteins. By definition, classification of objects requires an operational definition of similarity. There is a long history of the use of sharp PID cutoffs to define family and superfamily membership (for a review, see Doolittle, 1981xSimilar amino acid sequences (chance or common ancestry?) . Doolittle, R.F. Science. 1981; 214: 149–159Crossref | PubMedSee all ReferencesDoolittle, 1981). Generally, proteins belonging to the same homologous family (i.e., proteins with a clear evolutionary relationship) can be aligned to produce a PID ≥30 (Murzin et al., 1995xScop (a structural classification of proteins database for the investigation of sequences and structures) . Murzin, A.G, Brenner, S.E, Hubbard, T, and Chothia, C. J. Mol. Biol. 1995; 247: 536–540PubMedSee all References(Murzin et al., 1995). Members of a protein family have a common fold and usually a common function. A protein superfamily is defined as the union of ≥2 families, not all of whose members can be aligned to produce a PID ≥30 with all the other members of each family (Doolittle, 1981xSimilar amino acid sequences (chance or common ancestry?) . Doolittle, R.F. Science. 1981; 214: 149–159Crossref | PubMedSee all References(Doolittle, 1981). In the absence of “significant” PID, the probable common evolutionary origin of proteins within a superfamily must be inferred on the basis of shared structural and functional features (Murzin et al., 1995xScop (a structural classification of proteins database for the investigation of sequences and structures) . Murzin, A.G, Brenner, S.E, Hubbard, T, and Chothia, C. J. Mol. Biol. 1995; 247: 536–540PubMedSee all References(Murzin et al., 1995). Evolutionarily related sequences for which the PID after alignment is below the threshold level for inference of common ancestry are said to fall in the “twilight zone” (Doolittle, 1981xSimilar amino acid sequences (chance or common ancestry?) . Doolittle, R.F. Science. 1981; 214: 149–159Crossref | PubMedSee all References(Doolittle, 1981). Given the central role of PID cutoffs in the classification and modeling of proteins, it is important that the choice of denominator is clear. It is not difficult to think of arguments for choice of any of the four denominators; that is why here I have not rehearsed these arguments. However, I show that L4, arithmetic mean sequence length, is the best choice for the protein homologous families in HOMSTRAD on the basis of its maximal similarity to the three others. Of course, the relationships in Table 1Table 1 might not hold for other datasets. For instance, L4 might not be appropriate for alignments between very small sequences and big ones; the CV data (above) shows HOMSTRAD does not contain many such pairs.