Minimum Message Length based Mixture Modelling using Bivariate von Mises Distributions with Applications to Bioinformatics

The modelling of empirically observed data is commonly done using mixtures of probability distributions. In order to model angular data, directional probability distributions such as the bivariate von Mises (BVM) is typically used. The critical task involved in mixture modelling is to determine the optimal number of component probability distributions. We employ the Bayesian information-theoretic principle of minimum message length (MML) to distingush mixture models by balancing the trade-off between the model’s complexity and its goodness-of-fit to the data. We consider the problem of modelling angular data resulting from the spatial arrangement of protein structures using BVM distributions. The main contributions of the paper include the development of the mixture modelling apparatus along with the MML estimation of the parameters of the BVM distribution. We demonstrate that statistical inference using the MML framework supersedes the traditional methods and offers a mechanism to objectively determine models that are of practical significance.

[1]  K. Pearson VII. Note on regression and inheritance in the case of two parents , 1895, Proceedings of the Royal Society of London.

[2]  Anil K. Jain,et al.  Statistical Pattern Recognition: A Review , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  J. Richardson,et al.  The anatomy and taxonomy of protein structure. , 1981, Advances in protein chemistry.

[4]  C. S. Wallace,et al.  Estimation and Inference by Compact Coding , 1987 .

[5]  M. Powell A Direct Search Optimization Method That Models the Objective and Constraint Functions by Linear Interpolation , 1994 .

[6]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[7]  H. Akaike A new look at the statistical model identification , 1974 .

[8]  Jonathan J. Oliver,et al.  MDL and MML: Similarities and differences , 1994 .

[9]  Thomas Hamelryck,et al.  Probabilistic models and machine learning in structural bioinformatics , 2009, Statistical methods in medical research.

[10]  H. Bozdogan Choosing the Number of Component Clusters in the Mixture-Model Using a New Informational Complexity Criterion of the Inverse-Fisher Information Matrix , 1993 .

[11]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[12]  K. Mardia Characterizations of Directional Distributions , 1975 .

[13]  Inderjit S. Dhillon,et al.  Clustering on the Unit Hypersphere using von Mises-Fisher Distributions , 2005, J. Mach. Learn. Res..

[14]  William D. Penny,et al.  Bayesian Approaches to Gaussian Mixture Modeling , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[15]  David L. Dowe,et al.  Intrinsic classification by MML - the Snob program , 1994 .

[16]  Kanti V. Mardia,et al.  A multivariate von mises distribution with applications to bioinformatics , 2008 .

[17]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[18]  Lloyd Allison,et al.  Minimum message length estimation of mixtures of multivariate Gaussian and von Mises-Fisher distributions , 2015, Machine Learning.

[19]  A. F. Smith,et al.  Statistical analysis of finite mixture distributions , 1986 .

[20]  Louis-Paul Rivest,et al.  A distribution for dependent unit vectors , 1988 .

[21]  C. S. Wallace,et al.  Circular clustering of protein dihedral angles by Minimum Message Length. , 1996, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[22]  Inderjit S. Dhillon,et al.  Generative model-based clustering of directional data , 2003, KDD '03.

[23]  D. E. Amos,et al.  Computation of modified Bessel functions and their ratios , 1974 .

[24]  W. J. Whiten,et al.  Fitting Mixtures of Kent Distributions to Aid in Joint Set Identification , 2001 .

[25]  Thomas Hamelryck,et al.  Using the Fisher-Bingham distribution in stochastic models for protein structure , 2022 .

[26]  Anil K. Jain,et al.  Unsupervised Learning of Finite Mixture Models , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[27]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[28]  G. N. Ramachandran,et al.  Stereochemistry of polypeptide chain configurations. , 1963, Journal of molecular biology.

[29]  G. S. Watson,et al.  ON THE CONSTRUCTION OF SIGNIFICANCE TESTS ON THE CIRCLE AND THE SPHERE , 1956 .

[30]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[31]  J. Kent The Fisher‐Bingham Distribution on the Sphere , 1982 .

[32]  Kiheung Kim Protein , 2005, The Lancet.

[33]  Harshinder Singh,et al.  Probabilistic model for two dependent circular variables , 2002 .

[34]  R. Fisher Dispersion on a sphere , 1953, Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences.

[35]  N. Sloane,et al.  On the Voronoi Regions of Certain Lattices , 1984 .

[36]  Anders Krogh,et al.  Sampling Realistic Protein Conformations Using Local Structural Bias , 2006, PLoS Comput. Biol..

[37]  Gérard Govaert,et al.  Assessing a Mixture Model for Clustering with the Integrated Completed Likelihood , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[38]  C. S. Wallace,et al.  Unsupervised Learning Using MML , 1996, ICML.

[39]  K. Mardia,et al.  A general correlation coefficient for directional data and related regression problems , 1980 .

[40]  M. Rosenblatt Remarks on a Multivariate Transformation , 1952 .

[41]  Irene A. Stegun,et al.  Handbook of Mathematical Functions. , 1966 .

[42]  K. Mardia,et al.  Protein Bioinformatics and Mixtures of Bivariate von Mises Distributions for Angular Data , 2007, Biometrics.

[43]  Geoffrey E. Hinton,et al.  SMEM Algorithm for Mixture Models , 1998, Neural Computation.

[44]  C. S. Wallace,et al.  An Information Measure for Classification , 1968, Comput. J..

[45]  Ian W. Davis,et al.  Structure validation by Cα geometry: ϕ,ψ and Cβ deviation , 2003, Proteins.

[46]  Kanti V. Mardia,et al.  Statistics of Directional Data , 1972 .

[47]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[48]  Parthan Kasarapu,et al.  Modelling of directional data using Kent distributions , 2015, ArXiv.