Exact Inference Algorithms and Their Optimization in Bayesian Clustering

Clustering is a central task in computational statistics. Its aim is to divide observed data into groups of items, based on the similarity of their features. Among various approaches to clustering, Bayesian model-based clustering has recently gained popularity. Many existing works are based on stochastic sampling methods. This work is concerned with exact, exponential-time algorithms for the Bayesian model-based clustering task. In particular, we consider the exact computation of two summary statistics: the number of clusters, and pairwise incidence of items in the same cluster. We present an implemented algorithm for computing these statistics substantially faster than would be achieved by direct enumeration of the possible partitions. The method is practically applicable to data sets of up to approximately 25 items. We apply a variant of the exact inference method into graphical models where a given variable may have up to four parent variables. The parent variables can then have up to 16 value combinations, and the task is to cluster them and find combinations that lead to similar conditional probability tables. Further contributions of this work are related to number theory. We show that a novel combination of addition chains and additive bases provides the optimal arrangement of multiplications, when the task is to use repeated multiplication starting from a given number or entity, but only a certain kind of function of the successive powers is required. This arrangement speeds up the computation of the posterior distribution for the number of clusters. The same arrangement method can be applied to other multiplicative tasks, for example, in matrix multiplication. We also present new algorithmic results related to finding extremal additive bases. Before this work, the extremal additive bases were known up to length 23. We have computed them up to length 24 in the unrestricted case, and up to length 41 in the restricted case.

[1]  Svein Mossige Algorithms for computing the $h$-range of the postage stamp problem , 1981 .

[2]  Samuel S. Wagstaff,et al.  Additive h-bases for n , 1979 .

[3]  A new upper bound for finite additive bases , 2005, math/0503241.

[4]  P. Green,et al.  Bayesian Model-Based Clustering Procedures , 2007 .

[5]  John P. Robinson,et al.  Some Extremal Postage Stamp Bases , 2010 .

[6]  Jukka Corander,et al.  BAPS 2: enhanced possibilities for the analysis of genetic population structure , 2004, Bioinform..

[7]  Jukka Corander,et al.  Addition Chains Meet Postage Stamps: Reducing the Number of Multiplications , 2013, J. Integer Seq..

[8]  A. Stöhr,et al.  Gelöste und ungelöste Fragen über Basen der natürlichen Zahlenreihe. I. , 1955 .

[9]  Jeffrey A. Barnett,et al.  A Postage Stamp Problem , 1980 .

[10]  Arnulf Von Mrose Untere Schranken für die Reichweiten von Extremalbasen fester Ordnung , 1979 .

[11]  Hans Rohrbach Ein Beitrag zur additiven Zahlentheorie , 1937 .

[12]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[13]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[14]  Peter J. Downey,et al.  Computing Sequences with Addition Chains , 1981, SIAM J. Comput..

[15]  Melvyn B. Nathanson,et al.  Additive Number Theory: Inverse Problems and the Geometry of Sumsets , 1996 .

[16]  Neill Michael Clift Calculating optimal addition chains , 2011, Computing.

[17]  R. Guy Unsolved Problems in Number Theory , 1981 .

[18]  J. Riddell,et al.  Some extremal 2-bases , 1978 .

[19]  Mikko Koivisto,et al.  An O*(2^n ) Algorithm for Graph Coloring and Other Partitioning Problems via Inclusion--Exclusion , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[20]  Jukka Corander,et al.  Labeled directed acyclic graphs: a generalization of context-specific independence in directed graphical models , 2013, Data Mining and Knowledge Discovery.

[21]  Andreas Björklund,et al.  Fourier meets möbius: fast subset convolution , 2006, STOC '07.

[22]  Michael F. Challis Two New Techniques for Computing Extremal h-bases Ak , 1993, Comput. J..

[23]  Richard E. Neapolitan,et al.  Learning Bayesian networks , 2007, KDD '07.

[24]  Donatella Merlini,et al.  MAXIMUM STIRLING NUMBERS OF THE SECOND KIND , 2008 .

[25]  Terence Tao,et al.  Additive combinatorics , 2007, Cambridge studies in advanced mathematics.

[26]  Fedor V. Fomin,et al.  Exact exponential algorithms , 2013, CACM.

[27]  Jonathan M. Borwein,et al.  Mathematics by experiment - plausible reasoning in the 21st century , 2003 .

[28]  H. Bock Probabilistic models in cluster analysis , 1996 .

[29]  A. Stöhr,et al.  Gelöste und ungelöste Fragen über Basen der natürlichen Zahlenreihe. II. , 1955 .

[30]  R. Alter,et al.  REMARKS ON THE POSTAGE STAMP PROBLEM WITH APPLICATIONS TO COMPUTERS , 1977 .

[31]  Sheldon M. Ross,et al.  Introduction to probability models , 1975 .

[32]  Donald Ervin Knuth,et al.  The Art of Computer Programming, Volume II: Seminumerical Algorithms , 1970 .

[33]  F. Quintana,et al.  Bayesian clustering and product partition models , 2003 .

[34]  Paul Erdös,et al.  Bases for sets of integers , 1977 .

[35]  Mats Gyllenberg,et al.  Bayesian unsupervised classification framework based on stochastic partitions of data and a parallel search strategy , 2009, Adv. Data Anal. Classif..

[36]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[37]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.