PSIC: profile extraction from sequence alignments with position-specific counts of independent observations.

Sequence weighting techniques are aimed at balancing redundant observed information from subsets of similar sequences in multiple alignments. Traditional approaches apply the same weight to all positions of a given sequence, hence equal efficiency of phylogenetic changes is assumed along the whole sequence. This restrictive assumption is not required for the new method PSIC (position-specific independent counts) described in this paper. The number of independent observations (counts) of an amino acid type at a given alignment position is calculated from the overall similarity of the sequences that share the amino acid type at this position with the help of statistical concepts. This approach allows the fast computation of position-specific sequence weights even for alignments containing hundreds of sequences. The PSIC approach has been applied to profile extraction and to the fold family assignment of protein sequences with known structures. Our method was shown to be very productive in finding distantly related sequences and more powerful than Hidden Markov Models or the profile methods in WiseTools and PSI-BLAST in many cases. The profile extraction routine is available on the WWW (http://www.bork.embl-heidelberg. de/PSIC or http://www.imb.ac.ru/PSIC).

[1]  R. Fisher The Advanced Theory of Statistics , 1943, Nature.

[2]  Maurice G. Kendall,et al.  The advanced theory of statistics , 1945 .

[3]  M. O. Dayhoff,et al.  Atlas of protein sequence and structure , 1965 .

[4]  A. D. McLachlan,et al.  Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[5]  S F Altschul,et al.  Weights for data related by a tree. , 1989, Journal of molecular biology.

[6]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[7]  U. Hobohm,et al.  Selection of representative protein data sets , 1992, Protein science : a publication of the Protein Society.

[8]  M Vingron,et al.  Weighting in sequence space: a comparison of methods in terms of generalized sequences. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[10]  S. Altschul,et al.  Detection of conserved segments in proteins: iterative scanning of sequence databases with alignment blocks. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[11]  S. Henikoff,et al.  Position-based sequence weights. , 1994, Journal of molecular biology.

[12]  C. Chothia,et al.  Volume changes in protein evolution. , 1994, Journal of molecular biology.

[13]  Sean R. Eddy,et al.  Maximum Discrimination Hidden Markov Models of Sequence Consensus , 1995, J. Comput. Biol..

[14]  Jun S. Liu,et al.  Gibbs motif sampling: Detection of bacterial outer membrane protein repeats , 1995, Protein science : a publication of the Protein Society.

[15]  W. Pearson Comparison of methods for searching protein sequence databases , 1995, Protein science : a publication of the Protein Society.

[16]  S Udenfriend,et al.  How glycosylphosphatidylinositol-anchored membrane proteins are made. , 1995, Annual review of biochemistry.

[17]  S Pascarella,et al.  A databank (3D-ali) collecting related protein sequences and structures. , 1996, Protein engineering.

[18]  W. Bruno Modeling residue usage in aligned protein sequences via maximum likelihood. , 1996, Molecular biology and evolution.

[19]  T J Gibson,et al.  PairWise and SearchWise: finding the optimal alignment in a simultaneous comparison of a protein profile against all DNA translation frames. , 1996, Nucleic acids research.

[20]  T. Gibson,et al.  Applying motif and profile searches. , 1996, Methods in enzymology.

[21]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[22]  David C. Jones,et al.  CATH--a hierarchic classification of protein domain structures. , 1997, Structure.

[23]  D. Lipman,et al.  Extracting protein alignment models from the sequence database. , 1997, Nucleic acids research.

[24]  P Bork,et al.  Sequence properties of GPI-anchored proteins near the omega-site: constraints for the polypeptide binding site of the putative transamidase. , 1998, Protein engineering.