High-Dimensional Cluster Analysis with the Masked EM Algorithm

Cluster analysis faces two problems in high dimensions: the “curse of dimensionality” that can lead to overfitting and poor generalization performance and the sheer time taken for conventional algorithms to process large amounts of high-dimensional data. We describe a solution to these problems, designed for the application of spike sorting for next-generation, high-channel-count neural probes. In this problem, only a small subset of features provides information about the cluster membership of any one data vector, but this informative feature subset is not the same for all data points, rendering classical feature selection ineffective. We introduce a “masked EM” algorithm that allows accurate and time-efficient clustering of up to millions of points in thousands of dimensions. We demonstrate its applicability to synthetic data and to real-world high-channel-count spike sorting data.

[1]  H. Akaike A new look at the statistical model identification , 1974 .

[2]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[3]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[4]  M. A. Samphilipo,et al.  A reservoir model for repeated CSF access in the rabbit , 1987, Journal of Neuroscience Methods.

[5]  Geoffrey E. Hinton,et al.  The EM algorithm for mixtures of factor analyzers , 1996 .

[6]  M S Lewicki,et al.  A review of methods for spike sorting: the detection and classification of neural action potentials. , 1998, Network.

[7]  Richard A. Andersen,et al.  Latent variable models for neural data analysis , 1999 .

[8]  J. Csicsvari,et al.  Accuracy of tetrode spike separation as determined by simultaneous intracellular and extracellular measurements. , 2000, Journal of neurophysiology.

[9]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[10]  Geoffrey J. McLachlan,et al.  Modelling high-dimensional data by mixtures of factor analyzers , 2003, Comput. Stat. Data Anal..

[11]  Yoshio Sakurai,et al.  Automatic sorting for multi-neuronal activity recorded with tetrodes in the presence of overlapping spikes. , 2003, Journal of neurophysiology.

[12]  Marina Meila,et al.  Comparing Clusterings by the Variation of Information , 2003, COLT.

[13]  Shy Shoham,et al.  Robust, automatic spike sorting using mixtures of multivariate t-distributions , 2003, Journal of Neuroscience Methods.

[14]  J. Friedman,et al.  Clustering objects on subsets of attributes (with discussion) , 2004 .

[15]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[16]  R. Quian Quiroga,et al.  Unsupervised Spike Detection and Sorting with Wavelets and Superparamagnetic Clustering , 2004, Neural Computation.

[17]  A. Raftery,et al.  Variable Selection for Model-Based Clustering , 2006 .

[18]  Michael J. Black,et al.  A nonparametric Bayesian alternative to spike sorting , 2008, Journal of Neuroscience Methods.

[19]  Yee Whye Teh,et al.  Dependent Dirichlet Process Spike Sorting , 2008, NIPS.

[20]  Klaus Obermayer,et al.  An online spike detection and spike classification algorithm capable of instantaneous resolution of overlapping spikes , 2009, Journal of Computational Neuroscience.

[21]  Liam Paninski,et al.  Kalman Filter Mixture Model for Spike Sorting of Non-stationary Data , 2010 .

[22]  Jason S. Prentice,et al.  Fast, Scalable, Bayesian Spike Identification for Multi-Electrode Arrays , 2010, PloS one.

[23]  Kenneth D Harris,et al.  Towards reliable spike-train recordings from thousands of neurons with multielectrodes , 2012, Current Opinion in Neurobiology.

[24]  Michael J. Berry,et al.  Mapping a Complete Neural Population in the Retina , 2012, The Journal of Neuroscience.

[25]  Eero P. Simoncelli,et al.  A Model-Based Spike Sorting Algorithm for Removing Correlation Artifacts in Multi-Neuron Recordings , 2013, PloS one.

[26]  David B. Dunson,et al.  Multichannel Electrophysiological Spike Sorting via Joint Dictionary Learning and Mixture Modeling , 2013, IEEE Transactions on Biomedical Engineering.

[27]  Eero P. Simoncelli,et al.  Journal of Neuroscience Methods , 2022 .

[28]  Charles Bouveyron,et al.  Model-based clustering of high-dimensional data: A review , 2014, Comput. Stat. Data Anal..