Pharmacophore alignment search tool: Influence of canonical atom labeling on similarity searching

Previously, (Hähnke et al., J Comput Chem 2009, 30, 761) we presented the Pharmacophore Alignment Search Tool (PhAST), a ligand‐based virtual screening technique representing molecules as strings coding pharmacophoric features and comparing them by global pairwise sequence alignment. To guarantee unambiguity during the reduction of two‐dimensional molecular graphs to one‐dimensional strings, PhAST employs a graph canonization step. Here, we present the results of the comparison of 11 different algorithms for graph canonization with respect to their impact on virtual screening. Retrospective screenings of a drug‐like data set were evaluated using the BEDROC metric, which yielded averaged values between 0.4 and 0.14 for the best‐performing and worst‐performing canonization technique. We compared five scoring schemes for the alignments and found preferred combinations of canonization algorithms and scoring functions. Finally, we introduce a performance index that helps prioritize canonization approaches without the need for extensive retrospective evaluation. © 2010 Wiley Periodicals, Inc. J Comput Chem, 2010

[1]  K. Pearson Mathematical Contributions to the Theory of Evolution. III. Regression, Heredity, and Panmixia , 1896 .

[2]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[3]  M. Kendall A NEW MEASURE OF RANK CORRELATION , 1938 .

[4]  F. Massey The Kolmogorov-Smirnov Test for Goodness of Fit , 1951 .

[5]  Stephen Warshall,et al.  A Theorem on Boolean Matrices , 1962, JACM.

[6]  Stephen J. Garland,et al.  Algorithm 97: Shortest path , 1962, Commun. ACM.

[7]  Fred J. Damerau,et al.  A technique for computer detection and correction of spelling errors , 1964, CACM.

[8]  H. L. Morgan The Generation of a Unique Machine Description for Chemical Structures-A Technique Developed at Chemical Abstracts Service. , 1965 .

[9]  M. O. Dayhoff,et al.  Atlas of protein sequence and structure , 1965 .

[10]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[11]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[12]  Johann Gasteiger,et al.  Canonical Numbering and Constitutional Symmetry , 1977, J. Chem. Inf. Comput. Sci..

[13]  D. Lipman,et al.  Rapid and sensitive protein similarity searches. , 1985, Science.

[14]  David Weininger,et al.  SMILES. 2. Algorithm for generation of unique SMILES notation , 1989, J. Chem. Inf. Comput. Sci..

[15]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[16]  M S Waterman,et al.  Sequence alignment and penalty choice. Review of concepts, case studies and implications. , 1994, Journal of molecular biology.

[17]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[18]  C. Chothia,et al.  Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[19]  B. Borchers CSDP, A C library for semidefinite programming , 1999 .

[20]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[21]  Alexander K Hartmann,et al.  Sampling rare events: statistics of local sequence alignments. , 2001, Physical review. E, Statistical, nonlinear, and soft matter physics.

[22]  Mikhail Belkin,et al.  Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , 2003, Neural Computation.

[23]  Gisbert Schneider,et al.  Collection of bioactive reference compounds for focused library design , 2003 .

[24]  T. Speed,et al.  Biological Sequence Analysis , 1998 .

[25]  Susan R. Wilson,et al.  An Efficient Z-Score Algorithm for Assessing Sequence Alignments , 2004, J. Comput. Biol..

[26]  Robert D. Carr,et al.  The Signature Molecular Descriptor. 4. Canonizing Molecules Using Extended Valence Sequences , 2004, J. Chem. Inf. Model..

[27]  William Stafford Noble,et al.  Learning kernels from biological networks by maximizing entropy , 2004, ISMB/ECCB.

[28]  Krishnan Balasubramanian,et al.  A Simple Algorithm for Unique Representation of Chemical Structures-Cyclic/Acyclic Functionalized Achiral Molecules , 2006, J. Chem. Inf. Model..

[29]  Ralf Bundschuh,et al.  A Practical Approach to Significance Assessment in Alignment with Gaps , 2005, RECOMB.

[30]  Jörg K. Wegner,et al.  Molecular Query Language (MQL)A Context-Free Grammar for Substructure Matching , 2007, J. Chem. Inf. Model..

[31]  Christopher I. Bayly,et al.  Evaluating Virtual Screening Methods: Good and Bad Metrics for the "Early Recognition" Problem , 2007, J. Chem. Inf. Model..

[32]  Lee Aaron Newberg Significance of Gapped Sequence Alignments , 2008, J. Comput. Biol..

[33]  Wei Zhao,et al.  A statistical framework to evaluate virtual screening , 2009, BMC Bioinformatics.

[34]  Michael C. Hutter,et al.  Bioisosteric Similarity of Molecules Based on Structural Alignment and Observed Chemical Replacements in Drugs , 2009, J. Chem. Inf. Model..

[35]  Gisbert Schneider,et al.  PhAST: pharmacophore alignment search tool , 2009, J. Comput. Chem..