Batch discovery of recurring rare classes toward identifying anomalous samples

We present a clustering algorithm for discovering rare yet significant recurring classes across a batch of samples in the presence of random effects. We model each sample data by an infinite mixture of Dirichlet-process Gaussian-mixture models (DPMs) with each DPM representing the noisy realization of its corresponding class distribution in a given sample. We introduce dependencies across multiple samples by placing a global Dirichlet process prior over individual DPMs. This hierarchical prior introduces a sharing mechanism across samples and allows for identifying local realizations of classes across samples. We use collapsed Gibbs sampler for inference to recover local DPMs and identify their class associations. We demonstrate the utility of the proposed algorithm, processing a flow cytometry data set containing two extremely rare cell populations, and report results that significantly outperform competing techniques. The source code of the proposed algorithm is available on the web via the link:http://cs.iupui.edu/~dundar/aspire.htm.

[1]  Lancelot F. James,et al.  Gibbs Sampling Methods for Stick-Breaking Priors , 2001 .

[2]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[3]  J. Mesirov,et al.  GenePattern 2.0 , 2006, Nature Genetics.

[4]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[5]  Cliburn Chan,et al.  Hierarchical Modeling for Rare Event Detection and Cell Subset Alignment across Flow Cytometry Samples , 2013, PLoS Comput. Biol..

[6]  B. Schölkopf,et al.  Hierarchical Dirichlet Processes with Random Effects , 2007 .

[8]  Yuan Qi,et al.  Self-Adjusting Models for Semi-supervised Learning in Partially Observed Settings , 2012, 2012 IEEE 12th International Conference on Data Mining.

[9]  Jill P. Mesirov,et al.  Automated High-Dimensional Flow Cytometric Data Analysis , 2010, RECOMB.

[10]  Slobodan Vucetic,et al.  BudgetedSVM: a toolbox for scalable SVM approximations , 2013, J. Mach. Learn. Res..

[11]  Marcella Sarzotti-Kelsoe,et al.  Implementation of Good Clinical Laboratory Practice (GCLP) guidelines within the External Quality Assurance Program Oversight Laboratory (EQAPOL) , 2014, Journal of immunological methods.

[12]  Greg Finak,et al.  Critical assessment of automated flow cytometry data analysis techniques , 2013, Nature Methods.

[13]  Nature Genetics , 1991, Nature.

[14]  Bernhard Schölkopf,et al.  One-Class Support Measure Machines for Group Anomaly Detection , 2013, UAI.

[15]  Barnabás Póczos,et al.  Group Anomaly Detection using Flexible Genre Models , 2011, NIPS.