Modelling of directional data using Kent distributions

The modelling of data on a spherical surface requires the consideration of directional probability distributions. To model asymmetrically distributed data on a three-dimensional sphere, Kent distributions are often used. The moment estimates of the parameters are typically used in modelling tasks involving Kent distributions. However, these lack a rigorous statistical treatment. The focus of the paper is to introduce a Bayesian estimation of the parameters of the Kent distribution which has not been carried out in the literature, partly because of its complex mathematical form. We employ the Bayesian information-theoretic paradigm of Minimum Message Length (MML) to bridge this gap and derive reliable estimators. The inferred parameters are subsequently used in mixture modelling of Kent distributions. The problem of inferring the suitable number of mixture components is also addressed using the MML criterion. We demonstrate the superior performance of the derived MML-based parameter estimates against the traditional estimators. We apply the MML principle to infer mixtures of Kent distributions to model empirical data corresponding to protein conformations. We demonstrate the effectiveness of Kent models to act as improved descriptors of protein structural data as compared to commonly used von Mises-Fisher distributions.

[1]  Anders Krogh,et al.  Sampling Realistic Protein Conformations Using Local Structural Bias , 2006, PLoS Comput. Biol..

[2]  Kiheung Kim Protein , 2005, The Lancet.

[3]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[4]  Geoffrey E. Hinton,et al.  SMEM Algorithm for Mixture Models , 1998, Neural Computation.

[5]  D. E. Amos,et al.  Computation of modified Bessel functions and their ratios , 1974 .

[6]  S. R. Jammalamadaka,et al.  Directional Statistics, I , 2011 .

[7]  G. S. Watson,et al.  ON THE CONSTRUCTION OF SIGNIFICANCE TESTS ON THE CIRCLE AND THE SPHERE , 1956 .

[8]  Lloyd Allison,et al.  Minimum message length estimation of mixtures of multivariate Gaussian and von Mises-Fisher distributions , 2015, Machine Learning.

[9]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[10]  C. S. Wallace,et al.  An Information Measure for Classification , 1968, Comput. J..

[11]  J. Kent The Fisher‐Bingham Distribution on the Sphere , 1982 .

[12]  Gauss M. Cordeiro,et al.  Theory & Methods: Second‐order biases of the maximum likelihood estimates in von Mises regression models , 1999 .

[13]  S. S. Wilks The Large-Sample Distribution of the Likelihood Ratio for Testing Composite Hypotheses , 1938 .

[14]  Lloyd Allison,et al.  A new statistical framework to assess structural alignment quality using information compression , 2014, Bioinform..

[15]  M. Rosenblatt Remarks on a Multivariate Transformation , 1952 .

[16]  Anuj Srivastava,et al.  Statistical Shape Analysis , 2014, Computer Vision, A Reference Guide.

[17]  Yiming Yang,et al.  Von Mises-Fisher Clustering Models , 2014, ICML.

[18]  Thomas Hamelryck,et al.  Probabilistic models and machine learning in structural bioinformatics , 2009, Statistical methods in medical research.

[19]  Stefan Zubrzycki,et al.  Lectures in probability theory and mathematical statistics , 1972 .

[20]  Asaad M. Ganeiber,et al.  A new method to simulate the Bingham and related distributions in directional data analysis with applications , 2013, 1310.8110.

[21]  Gérard Govaert,et al.  Assessing a Mixture Model for Clustering with the Integrated Completed Likelihood , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[22]  C. S. Wallace,et al.  Unsupervised Learning Using MML , 1996, ICML.

[23]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[24]  David L. Dowe,et al.  MML Estimation of the Parameters of the Sherical Fisher Distribution , 1996, ALT.

[25]  M. Powell A Direct Search Optimization Method That Models the Objective and Constraint Functions by Linear Interpolation , 1994 .

[26]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[27]  Nicholas I. Fisher,et al.  Statistical Analysis of Spherical Data. , 1987 .

[28]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[29]  H. Akaike A new look at the statistical model identification , 1974 .

[30]  N. Fisher,et al.  The BIAS of the maximum likelihood estimators of the von mises-fisher concentration parameters , 1981 .

[31]  Jonathan J. Oliver,et al.  MDL and MML: Similarities and differences , 1994 .

[32]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[33]  Gauss M. Cordeiro,et al.  Bias correction in ARMA models , 1994 .

[34]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[35]  H. Bozdogan Choosing the Number of Component Clusters in the Mixture-Model Using a New Informational Complexity Criterion of the Inverse-Fisher Information Matrix , 1993 .

[36]  Andrew T. A. Wood,et al.  On the derivatives of the normalising constant of the Bingham distribution , 2007 .

[37]  Thomas Hamelryck,et al.  Using the Fisher-Bingham distribution in stochastic models for protein structure , 2022 .

[38]  Anil K. Jain,et al.  Unsupervised Learning of Finite Mixture Models , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[39]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[40]  R. Fisher Dispersion on a sphere , 1953, Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences.

[41]  N. Sloane,et al.  On the Voronoi Regions of Certain Lattices , 1984 .

[42]  G. Schou Estimation of the concentration parameter in von Mises–Fisher distributions , 1978 .

[43]  David Abramson,et al.  Statistical Inference of Protein "LEGO Bricks" , 2013, 2013 IEEE 13th International Conference on Data Mining.

[44]  Inderjit S. Dhillon,et al.  Generative model-based clustering of directional data , 2003, KDD '03.

[45]  Lloyd Allison,et al.  Minimum message length inference of secondary structure from protein coordinate data , 2012, Bioinform..

[46]  C. S. Wallace,et al.  Estimation and Inference by Compact Coding , 1987 .

[47]  Luiz H. Dore,et al.  Bias-corrected maximum likelihood estimation of the parameters of the complex Bingham distribution , 2016 .

[48]  C. S. Wallace,et al.  Circular clustering of protein dihedral angles by Minimum Message Length. , 1996, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[49]  P. Deb Finite Mixture Models , 2008 .

[50]  W. J. Whiten,et al.  Fitting Mixtures of Kent Distributions to Aid in Joint Set Identification , 2001 .

[51]  Nicholas I. Fisher,et al.  Statistical Analysis of Spherical Data. , 1987 .

[52]  K. Mardia,et al.  Protein Bioinformatics and Mixtures of Bivariate von Mises Distributions for Angular Data , 2007, Biometrics.

[53]  Thorsten Gerber,et al.  Handbook Of Mathematical Functions , 2016 .

[54]  Inderjit S. Dhillon,et al.  Clustering on the Unit Hypersphere using von Mises-Fisher Distributions , 2005, J. Mach. Learn. Res..

[55]  William D. Penny,et al.  Bayesian Approaches to Gaussian Mixture Modeling , 1998, IEEE Trans. Pattern Anal. Mach. Intell..