Don’t Just Divide; Polarize and Conquer!

In data containing heterogeneous subpopulations, classification performance benefits from incorporating the knowledge of cluster structure in the classifier. Previous methods for such combined clustering and classification are either 1) classifier-specific and not generic, or 2) independently perform clustering and classifier training, which may not form clusters that can potentially benefit classifier performance. The question of how to perform clustering to improve the performance of classifiers trained on the clusters has received scant attention in previous literature, despite its importance in several real-world applications. In this paper, we design a simple and efficient classification algorithm called Clustering Aware Classification (CAC), to find clusters that are well suited for being used as training datasets by classifiers for each underlying subpopulation. Our experiments on synthetic and real benchmark datasets demonstrate the efficacy of CAC over previous methods for combined clustering and classification.

[1]  Zied Elouedi,et al.  A hybrid approach based on decision trees and clustering for breast cancer classification , 2014, 2014 6th International Conference of Soft Computing and Pattern Recognition (SoCPaR).

[2]  Andreas Nürnberger,et al.  The Power of Ensembles for Active Learning in Image Classification , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[3]  Sajjan G. Shiva,et al.  Ensemble Classifiers for Network Intrusion Detection Using a Novel Network Attack Dataset , 2020, Future Internet.

[4]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[5]  Yizhou Sun,et al.  A Graph-Based Consensus Maximization Approach for Combining Multiple Supervised and Unsupervised Models , 2013, IEEE Transactions on Knowledge and Data Engineering.

[6]  S. Dasgupta The hardness of k-means clustering , 2008 .

[7]  Aapo Hyvärinen,et al.  Robust contrastive learning and nonlinear ICA in the presence of outliers , 2019, UAI.

[8]  G. Moody,et al.  Predicting in-hospital mortality of ICU patients: The PhysioNet/Computing in cardiology challenge 2012 , 2012, 2012 Computing in Cardiology.

[9]  Pranjal Awasthi,et al.  On the Rademacher Complexity of Linear Hypothesis Sets , 2020, ArXiv.

[10]  Dan Roth,et al.  Unsupervised Aggregation for Classification Problems with Large Numbers of Categories , 2010, AISTATS.

[11]  I-Cheng Yeh,et al.  The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients , 2009, Expert Syst. Appl..

[12]  Ralph Snyderman,et al.  Personalized health care: From theory to practice , 2012, Biotechnology journal.

[13]  Fuzhen Zhuang,et al.  Combining Supervised and Unsupervised Models via Unconstrained Probabilistic Embedding , 2011, IJCAI.

[14]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[15]  Daoqiang Zhang,et al.  A simultaneous learning framework for clustering and classification , 2009, Pattern Recognit..

[16]  Jiawei Han,et al.  Clustered Support Vector Machines , 2013, AISTATS.

[17]  Iain Murray,et al.  On Contrastive Learning for Likelihood-free Inference , 2020, ICML.

[18]  Jun Zhou,et al.  Mixing Linear SVMs for Nonlinear Classification , 2010, IEEE Transactions on Neural Networks.

[19]  Qiang Qian,et al.  Simultaneous clustering and classification over cluster structure representation , 2012, Pattern Recognit..

[20]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[21]  Wouter Duivesteijn,et al.  Softmax-based Classification is k-means Clustering: Formal Proof, Consequences for Adversarial Attacks, and Improvement through Centroid Based Tailoring , 2020, ArXiv.

[22]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[23]  Guang Cheng,et al.  Simultaneous Clustering and Estimation of Heterogeneous Graphical Models , 2016, J. Mach. Learn. Res..

[24]  Qi Tian,et al.  Locality-sensitive support vector machine by exploring local correlation and global regularization , 2011, CVPR 2011.

[25]  Matus Telgarsky,et al.  Spectrally-normalized margin bounds for neural networks , 2017, NIPS.

[26]  Niels Richard Hansen,et al.  Sparse group lasso and high dimensional multinomial classification , 2012, Comput. Stat. Data Anal..

[27]  Tanmoy Chakraborty,et al.  EC3: Combining Clustering and Classification for Ensemble Learning , 2017, 2017 IEEE International Conference on Data Mining (ICDM).

[28]  Alistair E. W. Johnson,et al.  Patient specific predictions in the intensive care unit using a Bayesian ensemble , 2012, 2012 Computing in Cardiology.

[29]  Andrea Visentin,et al.  Predicting Judicial Decisions: A Statistically Rigorous Approach and a New Ensemble Classifier , 2019, 2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI).

[30]  Roberto Cipolla,et al.  MCBoost: Multiple Classifier Boosting for Perceptual Co-clustering of Images and Visual Features , 2008, NIPS.

[31]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[32]  Izzat Alsmadi,et al.  Clustering and classification of email contents , 2015, J. King Saud Univ. Comput. Inf. Sci..

[33]  Zina M. Ibrahim,et al.  On classifying sepsis heterogeneity in the ICU: insight using machine learning , 2019, J. Am. Medical Informatics Assoc..

[34]  Ameet Talwalkar,et al.  Foundations of Machine Learning , 2012, Adaptive computation and machine learning.

[35]  Mihaela van der Schaar,et al.  Personalized Risk Scoring for Critical Care Patients using Mixtures of Gaussian Process Experts , 2016, ArXiv.

[36]  Philip H. S. Torr,et al.  Locally Linear Support Vector Machines , 2011, ICML.

[37]  Harini Suresh,et al.  Learning Tasks for Multitask Learning: Heterogenous Patient Populations in the ICU , 2018, KDD.

[38]  D. Zhu,et al.  Predicting Clinical Outcomes with Patient Stratification via Deep Mixture Neural Networks. , 2020, AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science.

[39]  Xuelong Li,et al.  Parameter Free Large Margin Nearest Neighbor for Distance Metric Learning , 2017, AAAI.

[40]  Anne L. Martel,et al.  A Cluster-then-label Semi-supervised Learning Approach for Pathology Image Classification , 2018, Scientific Reports.