An open source C++ implementation of multi-threaded Gaussian mixture models, k-means and expectation maximisation

Modelling of multivariate densities is a core component in many signal processing, pattern recognition and machine learning applications. The modelling is often done via Gaussian mixture models (GMMs), which use computationally expensive and potentially unstable training algorithms. We provide an overview of a fast and robust implementation of GMMs in the C++ language, employing multi-threaded versions of the Expectation Maximisation (EM) and k-means training algorithms. Multi-threading is achieved through reformulation of the EM and k-means algorithms into a MapReduce-like framework. Furthermore, the implementation uses several techniques to improve numerical stability and modelling accuracy. We demonstrate that the multi-threaded implementation achieves a speedup of an order of magnitude on a recent 16 core machine, and that it can achieve higher modelling accuracy than a previously well-established publically accessible implementation. The multi-threaded implementation is included as a user-friendly class in recent releases of the open source Armadillo C++ linear algebra library. The library is provided under the permissive Apache 2.0 license, allowing unencumbered use in commercial products.

[1]  Peter I. Corke,et al.  Exploiting Temporal Information for DCNN-Based Fine-Grained Object Classification , 2016, 2016 International Conference on Digital Image Computing: Techniques and Applications (DICTA).

[2]  Conrad Sanderson,et al.  Armadillo: a template-based C++ library for linear algebra , 2016, J. Open Source Softw..

[3]  Brian C. Lovell,et al.  Comparative Evaluation of Action Recognition Methods via Riemannian Manifolds, Fisher Vectors and GMMs: Ideal and Challenging Conditions , 2016, PAKDD Workshops.

[4]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[5]  D. Giusti,et al.  Structured Parallel Programming: patterns for efficient computation , 2015 .

[6]  Peter I. Corke,et al.  Modelling local deep convolutional neural network features to improve fine-grained image classification , 2015, 2015 IEEE International Conference on Image Processing (ICIP).

[7]  Yongkang Wong,et al.  Automatic Classification of Human Epithelial Type 2 Cell Indirect Immunofluorescence Images using Cell Pyramid Matching , 2014, bioRxiv.

[8]  Yongkang Wong,et al.  On robust face recognition via sparse coding: the good, the bad and the ugly , 2013, IET Biom..

[9]  Brian C. Lovell,et al.  Improved Foreground Detection via Block-Based Classifier Cascade With Probabilistic Decision Integration , 2013, IEEE Transactions on Circuits and Systems for Video Technology.

[10]  William B. March,et al.  MLPACK: a scalable C++ machine learning library , 2012, J. Mach. Learn. Res..

[11]  Michael I. Jordan,et al.  Revisiting k-means: New Algorithms via Bayesian Nonparametrics , 2011, ICML.

[12]  MonniauxDavid,et al.  The pitfalls of verifying floating-point computations , 2008 .

[13]  I. C. Gormley,et al.  Expectation Maximization Algorithm , 2008 .

[14]  Barbara Chapman,et al.  Using OpenMP - portable shared memory parallel programming , 2007, Scientific and engineering computation.

[15]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[16]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[17]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[18]  Andrew M. St. Laurent Understanding Open Source and Free Software Licensing , 2004 .

[19]  Greg Hamerly,et al.  Learning the k in k-means , 2003, NIPS.

[20]  Charles Elkan,et al.  Using the Triangle Inequality to Accelerate k-Means , 2003, ICML.

[21]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[22]  P. Maher,et al.  Handbook of Matrices , 1999, The Mathematical Gazette.

[23]  Jeff A. Bilmes,et al.  A gentle tutorial of the em algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models , 1998 .

[24]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[25]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[26]  Fabio Cocurullo,et al.  A new algorithm for vector quantization , 1995, Proceedings DCC '95 Data Compression Conference.

[27]  David Goldberg,et al.  What every computer scientist should know about floating-point arithmetic , 1991, CSUR.

[28]  R. Redner,et al.  Mixture densities, maximum likelihood, and the EM algorithm , 1984 .

[29]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[30]  Frank Dellaert,et al.  The Expectation Maximization Algorithm , 2002 .

[31]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[32]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[33]  H. Luetkepohl The Handbook of Matrices , 1996 .

[34]  David Goldberg What Every Computer Scientist Should Know About Floating-Point Arithmetic , 1992 .

[35]  David G. Stork,et al.  Pattern Classification , 1973 .

[36]  Repository : , 2000 .