Limits of homology detection by pairwise sequence comparison

MOTIVATION Noise in database searches resulting from random sequence similarities increases as the databases expand rapidly. The noise problems are not a technical shortcoming of the database search programs, but a logical consequence of the idea of homology searches. The effect can be observed in simulation experiments. RESULTS We have investigated noise levels in pairwise alignment based database searches. The noise levels of 38 releases of the SwissProt database, display perfect logarithmic growth with the total length of the databases. Clustering of real biological sequences reduces noise levels, but the effect is marginal.

[1]  Martin Vingron,et al.  A set-theoretic approach to database searching and clustering , 1998, Bioinform..

[2]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[3]  M. O. Dayhoff A model of evolutionary change in protein , 1978 .

[4]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Martin Vingron,et al.  Sequence Comparison Significance and Poisson Approximation , 1994 .

[6]  W. Taylor,et al.  Identification of protein sequence homology by consensus template alignment. , 1986, Journal of molecular biology.

[7]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[8]  A. Dembo,et al.  Limit Distribution of Maximal Non-Aligned Two-Sequence Segmental Score , 1994 .

[9]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[10]  Richard Hughey,et al.  Hidden Markov models for detecting remote protein homologies , 1998, Bioinform..

[11]  P. Bucher,et al.  Improving the sensitivity of the sequence profile method , 1994, Protein science : a publication of the Protein Society.

[12]  C. Chothia,et al.  Intermediate sequences increase the detection of homology between sequences. , 1997, Journal of molecular biology.

[13]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[14]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[15]  Martin Vingron,et al.  Statistics of large scale sequence searching , 1997, German Conference on Bioinformatics.

[16]  D. Haussler,et al.  Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. , 1998, Journal of molecular biology.

[17]  D. Lipman,et al.  Rapid and sensitive protein similarity searches. , 1985, Science.

[18]  C. Chothia,et al.  Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[19]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 , 2000, Nucleic Acids Res..

[20]  A. Rényi,et al.  On a new law of large numbers , 1970 .

[21]  Sean R. Eddy,et al.  Maximum Discrimination Hidden Markov Models of Sequence Consensus , 1995, J. Comput. Biol..

[22]  S. Altschul,et al.  Issues in searching molecular sequence databases , 1994, Nature Genetics.

[23]  A. Dembo,et al.  Critical Phenomena for Sequence Matching with Scoring , 1994 .

[24]  Lours,et al.  An Extreme Value Theory for Sequence Matching , 2022 .

[25]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence data bank and its supplement TrEMBL , 1997, Nucleic Acids Res..

[26]  A. D. McLachlan,et al.  Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.