Matrix sketching for supervised classification with imbalanced classes

Matrix sketching is a recently developed data compression technique. An input matrix A is efficiently approximated with a smaller matrix B, so that B preserves most of the properties of A up to some guaranteed approximation ratio. In so doing numerical operations on big data sets become faster. Sketching algorithms generally use random projections to compress the original dataset and this stochastic generation process makes them amenable to statistical analysis. The statistical properties of sketching algorithms have been widely studied in the context of multiple linear regression. In this paper we propose matrix sketching as a tool for rebalancing class sizes in supervised classification with imbalanced classes. It is well-known in fact that class imbalance may lead to poor classification performances especially as far as the minority class is concerned.

[1]  Bartosz Krawczyk,et al.  Learning from imbalanced data: open challenges and future directions , 2016, Progress in Artificial Intelligence.

[2]  David P. Woodruff,et al.  Low rank approximation and regression in input sparsity time , 2012, STOC '13.

[3]  K. Mardia Measures of multivariate skewness and kurtosis with applications , 1970 .

[4]  Norbert Henze,et al.  A class of invariant consistent tests for multivariate normality , 1990 .

[5]  Nicola Torelli,et al.  Training and assessing classification rules with imbalanced data , 2012, Data Mining and Knowledge Discovery.

[6]  Maher Maalouf,et al.  Computational Statistics and Data Analysis Robust Weighted Kernel Logistic Regression in Imbalanced and Rare Events Data , 2022 .

[7]  Trevor Hastie,et al.  LOCAL CASE-CONTROL SAMPLING: EFFICIENT SUBSAMPLING IN IMBALANCED DATA SETS. , 2013, Annals of statistics.

[8]  Bernard Chazelle,et al.  The Fast Johnson--Lindenstrauss Transform and Approximate Nearest Neighbors , 2009, SIAM J. Comput..

[9]  T. W. Anderson An Introduction to Multivariate Statistical Analysis , 1959 .

[10]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[11]  Nicola Torelli,et al.  ROSE: a Package for Binary Imbalanced Learning , 2014, R J..

[12]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[13]  Zhi-Hua Zhou,et al.  Ensemble Methods for Class Imbalance Learning , 2013 .

[14]  William J. Astle,et al.  Statistical properties of sketching algorithms , 2017, Biometrika.

[15]  Francisco Herrera,et al.  An insight into imbalanced Big Data classification: outcomes and challenges , 2017 .

[16]  A. Haar Der Massbegriff in der Theorie der Kontinuierlichen Gruppen , 1933 .

[17]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[18]  P WoodruffDavid Sketching as a Tool for Numerical Linear Algebra , 2014 .

[19]  H. Joe Generating random correlation matrices based on partial correlations , 2006 .

[20]  Jun Ni,et al.  Mining and Integrating Reliable Decision Rules for Imbalanced Cancer Gene Expression Data Sets , 2012 .

[21]  Ioannis A. Kakadiaris,et al.  Empowering Imbalanced Data in Supervised Learning: A Semi-supervised Learning Approach , 2014, ICANN.

[22]  Jing-Hao Xue,et al.  Why Does Rebalancing Class-Unbalanced Data Improve AUC for Linear Discriminant Analysis? , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  David P. Woodruff Sketching as a Tool for Numerical Linear Algebra , 2014, Found. Trends Theor. Comput. Sci..

[24]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[25]  José Javier Dolado,et al.  Preliminary comparison of techniques for dealing with imbalance in software defect prediction , 2014, EASE '14.

[26]  Nathalie Japkowicz,et al.  Manifold-based synthetic oversampling with manifold conformance estimation , 2018, Machine Learning.

[27]  G. McLachlan Discriminant Analysis and Statistical Pattern Recognition , 1992 .

[28]  Taeho Jo,et al.  Class imbalances versus small disjuncts , 2004, SKDD.

[29]  Yijing Li,et al.  Learning from class-imbalanced data: Review of methods and applications , 2017, Expert Syst. Appl..

[30]  Luís Torgo,et al.  A Survey of Predictive Modeling on Imbalanced Domains , 2016, ACM Comput. Surv..

[31]  Zhengding Qiu,et al.  The effect of imbalanced data sets on LDA: A theoretical and empirical analysis , 2007, Pattern Recognit..

[32]  R. S. Jadon,et al.  An Insight into Rare Class Problem: Analysis and Potential Solutions , 2018, J. Comput. Sci..

[33]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[34]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[35]  Lakhmi C. Jain,et al.  Emerging Paradigms in Machine Learning , 2012 .

[36]  Theofanis Sapatinas,et al.  Discriminant Analysis and Statistical Pattern Recognition , 2005 .

[37]  D. M. Titterington,et al.  Do unbalanced data have a negative effect on LDA? , 2008, Pattern Recognit..

[38]  Anja Vogler,et al.  An Introduction to Multivariate Statistical Analysis , 2004 .

[39]  Roberta Falcone,et al.  Supervised Classification with Matrix Sketching , 2018 .

[40]  Nitesh V. Chawla,et al.  SPECIAL ISSUE ON LEARNING FROM IMBALANCED DATA SETS , 2004 .

[41]  Carey E. Priebe,et al.  COMPARATIVE EVALUATION OF PATTERN RECOGNITION TECHNIQUES FOR DETECTION OF MICROCALCIFICATIONS IN MAMMOGRAPHY , 1993 .

[42]  Charles R. Johnson,et al.  Matrix Analysis, 2nd Ed , 2012 .

[43]  ChazelleBernard,et al.  The Fast Johnson-Lindenstrauss Transform and Approximate Nearest Neighbors , 2009 .

[44]  Shamik Sural,et al.  Credit card fraud detection: A fusion approach using Dempster-Shafer theory and Bayesian learning , 2009, Inf. Fusion.