A comprehensive representation of extensive similarity linkage between large numbers of proteins

A method is described for the representation of a bird's-eye view of similarity relationships between large numbers of proteins. With the aid of single-linkage clustering, proteins are clustered into groups on the basis of various types of similarity such as sequence similarity estimated between all the protein pairs. Proteins in a group are directly or indirectly connected to all proteins in the same group by similarities higher than a given threshold and show no similarity higher than the threshold to any proteins outside the group. Thus, all the proteins directly or indirectly related to a protein can be selected out of a large number of proteins by the clustering. Recursion of this clustering of proteins in each group leads to further classification of the proteins. The similarity relationships in each group are visually represented by a similarity matrix. This representation has the advantage of easy detection of the existence of multidomain proteins and diverged families as well as closely related proteins. Such as exhaustive approach to similarity relationships of proteins will be useful for revealing functional/structural/evolutionary units in proteins.