Catalysis Clustering with GAN by Incorporating Domain Knowledge

Clustering is an important unsupervised learning method with serious challenges when data is sparse and high-dimensional. Generated clusters are often evaluated with general measures, which may not be meaningful or useful for practical applications and domains. Using a distance metric, a clustering algorithm searches through the data space, groups close items into one cluster, and assigns far away samples to different clusters. In many real-world applications, the number of dimensions is high and data space becomes very sparse. Selection of a suitable distance metric is very difficult and becomes even harder when categorical data is involved. Moreover, existing distance metrics are mostly generic, and clusters created based on them will not necessarily make sense to domain-specific applications. One option to address these challenges is to integrate domain-defined rules and guidelines into the clustering process. In this work we propose a GAN-based approach called Catalysis Clustering to incorporate domain knowledge into the clustering process. With GANs we generate catalysts, which are special synthetic points drawn from the original data distribution and verified to improve clustering quality when measured by a domain-specific metric. We then perform clustering analysis using both catalysts and real data. Final clusters are produced after catalyst points are removed. Experiments on two challenging real-world datasets clearly show that our approach is effective and can generate clusters that are meaningful and useful for real-world applications.

[1]  Vipin Kumar,et al.  The Challenges of Clustering High Dimensional Data , 2004 .

[2]  M. Goel,et al.  Understanding survival analysis: Kaplan-Meier estimate , 2010, International journal of Ayurveda research.

[3]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[4]  Steven J. M. Jones,et al.  Integrated genomic characterization of endometrial carcinoma , 2013, Nature.

[5]  L. N. Allen,et al.  Financial survival analysis of defaulted debtors , 2006, J. Oper. Res. Soc..

[6]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[7]  Andrew M. Gross,et al.  Network-based stratification of tumor mutations , 2013, Nature Methods.

[8]  Léon Bottou,et al.  Wasserstein Generative Adversarial Networks , 2017, ICML.

[9]  Mike Stoolmiller,et al.  Embedding multilevel survival analysis of dyadic social interaction in structural equation models: hazard rates as both outcomes and predictors. , 2014, Journal of pediatric psychology.

[10]  F. Harrell Introduction to Survival Analysis , 2015 .

[11]  G. Escobar,et al.  Identifying Distinct Subgroups of ICU Patients: A Machine Learning Approach* , 2017, Critical care medicine.

[12]  Seetha Hari,et al.  Learning From Imbalanced Data , 2019, Advances in Computer and Electrical Engineering.

[13]  Jennifer G. Dy,et al.  Clustering with Domain-Specific Usefulness Scores , 2017, SDM.

[14]  Benjamin J. Raphael,et al.  Integrated Genomic Analyses of Ovarian Carcinoma , 2011, Nature.

[15]  Gary D Bader,et al.  International network of cancer genome projects , 2010, Nature.

[16]  Fang Liu,et al.  Task-Oriented GAN for PolSAR Image Classification and Clustering , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[17]  Jill P. Mesirov,et al.  Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data , 2003, Machine Learning.

[18]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[19]  Pieter Abbeel,et al.  InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets , 2016, NIPS.

[20]  S. Gabriel,et al.  Discovery and saturation analysis of cancer genes across 21 tumor types , 2014, Nature.

[21]  Laura Lee Johnson,et al.  An Introduction to Survival Analysis , 2012 .

[22]  Mike Thomas,et al.  Cluster analysis and clinical asthma phenotypes. , 2008, American journal of respiratory and critical care medicine.

[23]  Steven A. Roberts,et al.  Mutational heterogeneity in cancer and the search for new cancer-associated genes , 2013 .

[24]  C. Jacke,et al.  Using relative survival measures for cross-sectional and longitudinal benchmarks of countries, states, and districts: the BenchRelSurv- and BenchRelSurvPlot-macros , 2013, BMC Public Health.

[25]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[26]  Jost Tobias Springenberg,et al.  Unsupervised and Semi-supervised Learning with Categorical Generative Adversarial Networks , 2015, ICLR.

[27]  Sreeram Kannan,et al.  ClusterGAN : Latent Space Clustering in Generative Adversarial Networks , 2018, AAAI.

[28]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[29]  C. Planey,et al.  CoINcIDE: A framework for discovery of patient subtypes across multiple datasets , 2016, Genome Medicine.

[30]  Roberto Tagliaferri,et al.  Robust clustering of noisy high-dimensional gene expression data for patients subtyping , 2018, Bioinform..

[31]  José Manuel Pereira,et al.  Survival Analysis Employed in Predicting Corporate Failure: A Forecasting Model Proposal , 2014 .

[32]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[33]  Cheng Deng,et al.  Balanced Self-Paced Learning for Generative Adversarial Clustering Network , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).