Simple is beautiful: a straightforward approach to improve the delineation of true and false positives in PSI-BLAST searches

MOTIVATION The deluge of biological information from different genomic initiatives and the rapid advancement in biotechnologies have made bioinformatics tools an integral part of modern biology. Among the widely used sequence alignment tools, BLAST and PSI-BLAST are arguably the most popular. PSI-BLAST, which uses an iterative profile position specific score matrix (PSSM)-based search strategy, is more sensitive than BLAST in detecting weak homologies, thus making it suitable for remote homolog detection. Many refinements have been made to improve PSI-BLAST, and its computational efficiency and high specificity have been much touted. Nevertheless, corruption of its profile via the incorporation of false positive sequences remains a major challenge. RESULTS We have developed a simple and elegant approach to resolve the problem of model corruption in PSI-BLAST searches. We hypothesized that combining results from the first (least-corrupted) profile with results from later (most sensitive) iterations of PSI-BLAST provides a better discriminator for true and false hits. Accordingly, we have derived a formula that utilizes the E-values from these two PSI-BLAST iterations to obtain a figure of merit for rank-ordering the hits. Our verification results based on a 'gold-standard' test set indicate that this figure of merit does indeed delineate true positives from false positives better than PSI-BLAST E-values. Perhaps what is most notable about this strategy is that it is simple and straightforward to implement.

[1]  Richard Hughey,et al.  Hidden Markov models for detecting remote protein homologies , 1998, Bioinform..

[2]  M S Waterman,et al.  Sequence alignment and penalty choice. Review of concepts, case studies and implications. , 1994, Journal of molecular biology.

[3]  Richard Mott,et al.  Maximum-likelihood estimation of the statistical distribution of Smith-Waterman local sequence similarity scores , 1992 .

[4]  Thomas L. Madden,et al.  Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. , 2001, Nucleic acids research.

[5]  S. Eddy Hidden Markov models. , 1996, Current opinion in structural biology.

[6]  Tiffani J. Bright,et al.  PubMatrix: a tool for multiplex literature mining , 2003, BMC Bioinformatics.

[7]  Temple F. Smith,et al.  The statistical distribution of nucleic acid similarities. , 1985, Nucleic acids research.

[8]  D. Haussler,et al.  Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. , 1998, Journal of molecular biology.

[9]  A. D. McLachlan,et al.  Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[10]  S. Karlin,et al.  Applications and statistics for multiple high-scoring segments in molecular sequences. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[11]  R. Bundschuh,et al.  Distant homology detection using a LEngth and STructure‐based sequence Alignment Tool (LESTAT) , 2007, Proteins.

[12]  Michael Gribskov,et al.  Use of Receiver Operating Characteristic (ROC) Analysis to Evaluate Sequence Matching , 1996, Comput. Chem..

[13]  Michal Linial,et al.  PANDORA: keyword-based analysis of protein sets by integration of annotation sources. , 2003, Nucleic acids research.

[14]  C. Chothia,et al.  Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. , 2001, Journal of molecular biology.

[15]  Anders Krogh,et al.  Hidden Markov models for sequence analysis: extension and analysis of the basic method , 1996, Comput. Appl. Biosci..

[16]  Amir Dembo,et al.  LIMIT DISTRIBUTIONS OF MAXIMAL SEGMENTAL SCORE AMONG MARKOV-DEPENDENT PARTIAL SUMS , 1992 .

[17]  J U Bowie,et al.  Three-dimensional profiles for measuring compatibility of amino acid sequence with three-dimensional structure. , 1996, Methods in enzymology.

[18]  Olivier Poch,et al.  A comprehensive comparison of multiple sequence alignment programs , 1999, Nucleic Acids Res..

[19]  Georg Fuellen,et al.  Comparative homology agreement search: an effective combination of homology-search methods. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[20]  William Noble Grundy,et al.  Homology Detection via Family Pairwise Search , 1998, J. Comput. Biol..

[21]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[22]  S F Altschul,et al.  Local alignment statistics. , 1996, Methods in enzymology.

[23]  Hagit Shatkay,et al.  SherLoc: high-accuracy prediction of protein subcellular localization by integrating text and protein sequence data. , 2007, Bioinformatics.

[24]  M S Waterman,et al.  Rapid and accurate estimates of statistical significance for sequence data base searches. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[25]  E. Koonin,et al.  Gleaning non-trivial structural, functional and evolutionary information about proteins by iterative database searches. , 1999, Journal of molecular biology.

[26]  J. F. Collins,et al.  The significance of protein sequence similarities , 1988, Comput. Appl. Biosci..

[27]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[28]  Alejandro A. Schäffer,et al.  A structure-based method for protein sequence alignment , 2005, Bioinform..

[29]  O. Gotoh Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments. , 1996, Journal of molecular biology.

[30]  Temple F. Smith,et al.  Comparison of biosequences , 1981 .