Statistical Analysis of Unlabeled Point Sets: Comparing Molecules in Chemoinformatics

We consider Bayesian methodology for comparing two or more unlabeled point sets. Application of the technique to a set of steroid molecules illustrates its potential utility involving the comparison of molecules in chemoinformatics and bioinformatics. We initially match a pair of molecules, where one molecule is regarded as random and the other fixed. A type of mixture model is proposed for the point set coordinates, and the parameters of the distribution are a labeling matrix (indicating which pairs of points match) and a concentration parameter. An important property of the likelihood is that it is invariant under rotations and translations of the data. Bayesian inference for the parameters is carried out using Markov chain Monte Carlo simulation, and it is demonstrated that the procedure works well on the steroid data. The posterior distribution is difficult to simulate from, due to multiple local modes, and we also use additional data (partial charges on atoms) to help with this task. An approximation is considered for speeding up the simulation algorithm, and the approximating fast algorithm leads to essentially identical inference to that under the exact method for our data. Extensions to multiple molecule alignment are also introduced, and an algorithm is described which also works well on the steroid data set. After all the steroid molecules have been matched, exploratory data analysis is carried out to examine which molecules are similar. Also, further Bayesian inference for the multiple alignment problem is considered.

[1]  K. Mardia,et al.  Size and shape analysis of landmark data , 1992 .

[2]  John H. Van Drie,et al.  Strategies for the determination of pharmacophoric 3D database queries , 1997, J. Comput. Aided Mol. Des..

[3]  Jonathan D. Hirst,et al.  On the Stability of CoMFA Models , 2004, J. Chem. Inf. Model..

[4]  K Schulten,et al.  VMD: visual molecular dynamics. , 1996, Journal of molecular graphics.

[5]  Christian Lemmen,et al.  Computational methods for the structural alignment of molecules , 2000, J. Comput. Aided Mol. Des..

[6]  Peter J. Green,et al.  Bayesian alignment using hierarchical models, with applications in protein bioinformatics , 2005 .

[7]  A. Good,et al.  Structure-activity relationships from molecular similarity matrices. , 1993, Journal of medicinal chemistry.

[8]  C. Goodall Procrustes methods in the statistical analysis of shape , 1991 .

[9]  Edwin R. Hancock,et al.  Registering incomplete radar images using the EM algorithm , 1997, Image Vis. Comput..

[10]  Anand Rangarajan,et al.  A new point matching algorithm for non-rigid registration , 2003, Comput. Vis. Image Underst..

[11]  H. Chui,et al.  A feature registration framework using mixture models , 2000, Proceedings IEEE Workshop on Mathematical Methods in Biomedical Image Analysis. MMBIA-2000 (Cat. No.PR00737).

[12]  Edwin R. Hancock,et al.  Graph Matching With a Dual-Step EM Algorithm , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  F. Bookstein,et al.  Morphometric Tools for Landmark Data: Geometry and Biology , 1999 .

[14]  J. Gross,et al.  Graph Theory and Its Applications , 1998 .

[15]  C. Small The statistical theory of shape , 1996 .

[16]  Computational screening of combinatorial catalyst libraries. , 2004, Chemical communications.

[17]  Eric Mjolsness,et al.  New Algorithms for 2D and 3D Point Matching: Pose Estimation and Correspondence , 1998, NIPS.

[18]  K. Mardia,et al.  ‘Shape, Procrustes tangent projections and bilateral symmetry’ , 2001 .

[19]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[20]  J. Gasteiger,et al.  Autocorrelation of Molecular Surface Properties for Modeling Corticosteroid Binding Globulin and Cytosolic Ah Receptor Activity by Neural Networks , 1995 .

[21]  Peter Willett,et al.  Alignment of three-dimensional molecules using an image recognition algorithm. , 2004, Journal of molecular graphics & modelling.

[22]  Anand Rangarajan,et al.  The Softassign Procrustes Matching Algorithm , 1997, IPMI.

[23]  D. Kendall,et al.  The Riemannian Structure of Euclidean Shape Spaces: A Novel Environment for Statistics , 1993 .

[24]  D. Kendall SHAPE MANIFOLDS, PROCRUSTEAN METRICS, AND COMPLEX PROJECTIVE SPACES , 1984 .

[25]  J. Gower Generalized procrustes analysis , 1975 .

[26]  David G. Kendall,et al.  Shape & Shape Theory , 1999 .

[27]  Trevor J. Hastie,et al.  Regression analysis of multiple protein structures , 1998, RECOMB '98.

[28]  Trevor J. Hastie,et al.  Regression Analysis of Multiple Protein Structures , 1998, J. Comput. Biol..