Metric Learning from Imbalanced Data with Generalization Guarantees

Abstract Since many machine learning algorithms require a distance metric to capture dis/similarities between data points, metric learning has received much attention during the past decade. Surprisingly, very few methods have focused on learning a metric in an imbalanced scenario where the number of positive examples is much smaller than the negatives, and even fewer derived theoretical guarantees in this setting. Here, we address this difficult task and design a new Mahalanobis metric learning algorithm (IML) which deals with class imbalance. We further prove a generalization bound involving the proportion of positive examples using the uniform stability framework. The empirical study performed on a wide range of datasets shows the efficiency of IML.

[1]  Brian Kulis,et al.  Metric Learning: A Survey , 2013, Found. Trends Mach. Learn..

[2]  Stéphan Clémençon,et al.  A Probabilistic Theory of Supervised Similarity Learning for Pointwise ROC Curve Optimization , 2018, ICML.

[3]  John Langford,et al.  Cost-sensitive learning by cost-proportionate example weighting , 2003, Third IEEE International Conference on Data Mining.

[4]  Wei Wu,et al.  Dynamic Curriculum Learning for Imbalanced Data Classification , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[5]  Kun Liu,et al.  Defect detection on EL images based on deep feature optimized by metric learning for imbalanced data , 2019, 2019 25th International Conference on Automation and Computing (ICAC).

[6]  Yue Gao,et al.  Iterative Metric Learning for Imbalance Data Classification , 2018, IJCAI.

[7]  Suvrit Sra,et al.  Geometric Mean Metric Learning , 2016, ICML.

[8]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[9]  Robert C. Holte,et al.  C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling , 2003 .

[10]  Fernando Bação,et al.  Effective data generation for imbalanced learning using conditional generative adversarial networks , 2018, Expert Syst. Appl..

[11]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[12]  Lin Feng,et al.  Learning a Distance Metric by Balancing KL-Divergence for Imbalanced Datasets , 2019, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[13]  Inderjit S. Dhillon,et al.  Information-theoretic metric learning , 2006, ICML '07.

[14]  Gert R. G. Lanckriet,et al.  Metric Learning to Rank , 2010, ICML.

[15]  Michael I. Jordan,et al.  Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.

[16]  Feiping Nie,et al.  Learning a Mahalanobis distance metric for data clustering and classification , 2008, Pattern Recognit..

[17]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[18]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[19]  Rong Jin,et al.  Regularized Distance Metric Learning: Theory and Algorithm , 2009, NIPS.

[20]  Zhi-Hua Zhou,et al.  Exploratory Under-Sampling for Class-Imbalance Learning , 2006, Sixth International Conference on Data Mining (ICDM'06).

[21]  Marc Sebban,et al.  Efficient Top Rank Optimization with Gradient Boosting for Supervised Anomaly Detection , 2017, ECML/PKDD.

[22]  Charu C. Aggarwal,et al.  Outlier Analysis , 2013, Springer New York.

[23]  Jiwen Lu,et al.  Neighborhood repulsed metric learning for kinship verification , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[25]  Thorsten Joachims,et al.  Learning a Distance Metric from Relative Comparisons , 2003, NIPS.

[26]  Fernando Nogueira,et al.  Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning , 2016, J. Mach. Learn. Res..

[27]  Rong Jin,et al.  Rank-based distance metric learning: An application to image retrieval , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[29]  Qiong Cao,et al.  Generalization bounds for metric and similarity learning , 2012, Machine Learning.

[30]  Taeho Jo,et al.  A Multiple Resampling Method for Learning from Imbalanced Data Sets , 2004, Comput. Intell..

[31]  Luís Torgo,et al.  A Survey of Predictive Modeling on Imbalanced Domains , 2016, ACM Comput. Surv..

[32]  Jorge Nocedal,et al.  Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization , 1997, TOMS.

[33]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[34]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[35]  Kilian Q. Weinberger,et al.  Fast solvers and efficient implementations for distance metric learning , 2008, ICML '08.

[36]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.