Bayesian alignment using hierarchical models, with applications in protein bioinformatics

An important problem in shape analysis is to match configurations of points in space after filtering out some geometrical transformation. In this paper we introduce hierarchical models for such tasks, in which the points in the configurations are either unlabelled or have at most a partial labelling constraining the matching, and in which some points may only appear in one of the configurations. We derive procedures for simultaneous inference about the matching and the transformation, using a Bayesian approach. Our hierarchical model is based on a Poisson process for hidden true point locations; this leads to considerable mathematical simplification and efficiency of implementation of EM and Markov chain Monte Carlo algorithms. We find a novel use for classical distributions from directional statistics in a conditionally conjugate specification for the case where the geometrical transformation includes an unknown rotation. Throughout, we focus on the case of affine or rigid motion transformations. Under a broad parametric family of loss functions, an optimal Bayesian point estimate of the matching matrix can be constructed that depends only on a single parameter of the family. Our methods are illustrated by two applications from bioinformatics. The first problem is of matching protein gels in two dimensions, and the second consists of aligning active sites of proteins in three dimensions. In the latter case, we also use information related to the grouping of the amino acids, as an example of a more general capability of our methodology to include partial labelling information. We discuss some open problems and suggest directions for future work. Copyright 2006, Oxford University Press.

[1]  Jonathan D Hirst,et al.  Statistical Analysis of Unlabeled Point Sets: Comparing Molecules in Chemoinformatics , 2007, Biometrics.

[2]  K. Mardia,et al.  The von Mises–Fisher Matrix Distribution in Orientation Statistics , 1977 .

[3]  Lars Pedersen,et al.  Analysis of Two-Dimensional Electrophoresis Gel Images , 2002 .

[4]  Peter Green,et al.  A primer in Markov Chain Monte Carlo , 2001 .

[5]  D. Flinn Orientation Statistics , 1967, Nature.

[6]  Edwin R. Hancock,et al.  Graph Matching With a Dual-Step EM Algorithm , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Rainer E. Burkard,et al.  Linear Assignment Problems and Extensions , 1999, Handbook of Combinatorial Optimization.

[8]  R. Nussinov,et al.  Three‐dimensional, sequence order‐independent structural comparison of a serine protease against the crystallographic database reveals active site similarities: Potential implications to evolution and to protein folding , 1994, Protein science : a publication of the Protein Society.

[9]  Brian Fenton,et al.  Superimposing two‐dimensional gels to study genetic variation in malaria parasites , 1992, Electrophoresis.

[10]  Trevor J. Hastie,et al.  Regression analysis of multiple protein structures , 1998, RECOMB '98.

[11]  K. V. Mardia,et al.  A Small Circle of Best Fit for Spherical Data and Areas of Vulcanism , 1977 .

[12]  D. K. Friesen,et al.  A combinatorial algorithm for calculating ligand binding , 1984 .

[13]  Anand Rangarajan,et al.  A new algorithm for non-rigid point matching , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[14]  K. Mardia,et al.  Matching problems for unlabelled configurations , 2003 .

[15]  Kanti V. Mardia,et al.  Bayesian inference for the von Mises-Fisher distribution , 1976 .

[16]  C. Bron,et al.  Algorithm 457: finding all cliques of an undirected graph , 1973 .

[17]  King-Sun Fu,et al.  IEEE Transactions on Pattern Analysis and Machine Intelligence Publication Information , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  P. Green,et al.  On Bayesian Analysis of Mixtures with an Unknown Number of Components (with discussion) , 1997 .