Kernel Matrix Approximation on Class-Imbalanced Data With an Application to Scientific Simulation

Generating low-rank approximations of kernel matrices that arise in nonlinear machine learning techniques holds the potential to significantly alleviate the memory and computational burdens. A compelling approach centers on finding a concise set of exemplars or landmarks to reduce the number of similarity measure evaluations from quadratic to linear concerning the data size. However, a key challenge is to regulate tradeoffs between the quality of landmarks and resource consumption. Despite the volume of research in this area, current understanding is limited regarding the performance of landmark selection techniques in the presence of class-imbalanced data sets that are becoming increasingly prevalent in many applications. Hence, this paper provides a comprehensive empirical investigation using several real-world imbalanced data sets, including scientific data, by evaluating the quality of approximate low-rank decompositions and examining their influence on the accuracy of downstream tasks. Furthermore, we present a new landmark selection technique called Distance-based Importance Sampling and Clustering (DISC), in which the relative importance scores are computed for improving accuracy-efficiency tradeoffs compared to existing works that range from probabilistic sampling to clustering methods. The proposed landmark selection method follows a coarse-to-fine strategy to capture the intrinsic structure of complex data sets, allowing us to substantially reduce the computational complexity and memory footprint with minimal loss in accuracy.

[1]  Daniele Calandriello,et al.  Statistical and Computational Trade-Offs in Kernel K-Means , 2019, NeurIPS.

[2]  Anand Raghunathan,et al.  Energy-Efficient Reduce-and-Rank Using Input-Adaptive Approximations , 2017, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[3]  Charu C. Aggarwal,et al.  Scalable Spectral Clustering Using Random Binning Features , 2018, KDD.

[4]  Mohammad Amin Hariri-Ardebili,et al.  Seismic Stability Assessment of a High-Rise Concrete Tower Utilizing Endurance Time Analysis , 2014 .

[5]  Mohammad Amin Hariri-Ardebili,et al.  A series of forecasting models for seismic evaluation of dams based on ground motion meta-features , 2020 .

[6]  C. Jiang,et al.  Probability-interval hybrid uncertainty analysis for structures with both aleatory and epistemic uncertainties: a review , 2018 .

[7]  Zhuang Wang,et al.  Scaling Up Kernel SVM on Limited Resources: A Low-Rank Linearization Approach , 2012, IEEE Transactions on Neural Networks and Learning Systems.

[8]  David P. Woodruff,et al.  Faster Kernel Ridge Regression Using Sketching and Preconditioning , 2016, SIAM J. Matrix Anal. Appl..

[9]  Jon C. Helton,et al.  Guest editorial: treatment of aleatory and epistemic uncertainty in performance assessments for complex systems , 1996 .

[10]  Francisco Herrera,et al.  SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary , 2018, J. Artif. Intell. Res..

[11]  Stephen Becker,et al.  Preconditioned Data Sparsification for Big Data With Applications to PCA and K-Means , 2015, IEEE Transactions on Information Theory.

[12]  Michael W. Mahoney,et al.  Fast Randomized Kernel Ridge Regression with Statistical Guarantees , 2015, NIPS.

[13]  Nathan Halko,et al.  Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..

[14]  Daniel P. Robinson,et al.  Scalable Exemplar-based Subspace Clustering on Class-Imbalanced Data , 2018, European Conference on Computer Vision.

[15]  Mohammad Amin Hariri-Ardebili,et al.  MCS-based response surface metamodels and optimal design of experiments for gravity dams , 2018, Structure and Infrastructure Engineering.

[16]  Mohammad Amin Hariri-Ardebili,et al.  Kernel Ridge Regression Using Importance Sampling with Application to Seismic Response Prediction , 2020, ArXiv.

[17]  Joachim Schreurs,et al.  Diversity sampling is an implicit regularization for kernel methods , 2020, ArXiv.

[18]  Yang Song,et al.  Class-Balanced Loss Based on Effective Number of Samples , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Stephen Becker,et al.  Improved Fixed-Rank Nyström Approximation via QR Decomposition: Practical and Theoretical Aspects , 2017, Neurocomputing.

[20]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[21]  Fernando Nogueira,et al.  Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning , 2016, J. Mach. Learn. Res..

[22]  Marco Cavazzuti,et al.  Optimization Methods: From Theory to Design Scientific and Technological Aspects in Mechanics , 2012 .

[23]  Michael Rabadi,et al.  Kernel Methods for Machine Learning , 2015 .

[24]  Michael W. Mahoney,et al.  Determinantal Point Processes in Randomized Numerical Linear Algebra , 2020, Notices of the American Mathematical Society.

[25]  Keaton Hamm,et al.  Rapid Robust Principal Component Analysis: CUR Accelerated Inexact Low Rank Estimation , 2020, IEEE Signal Processing Letters.

[26]  James T. Kwok,et al.  Clustered Nyström Method for Large Scale Manifold Learning and Dimension Reduction , 2010, IEEE Transactions on Neural Networks.

[27]  Francis R. Bach,et al.  Sharp analysis of low-rank kernel matrix approximations , 2012, COLT.

[28]  Seetha Hari,et al.  Learning From Imbalanced Data , 2019, Advances in Computer and Electrical Engineering.

[29]  Suvrit Sra,et al.  Fast DPP Sampling for Nystrom with Application to Kernel Methods , 2016, ICML.

[30]  Zhiqiang Wan,et al.  A compatible probabilistic framework for quantification of simultaneous aleatory and epistemic uncertainty of basic parameters of structures by synthesizing the change of measure and change of random variables , 2019, Structural Safety.

[31]  Andrew Gordon Wilson,et al.  Gaussian Process Kernels for Pattern Discovery and Extrapolation , 2013, ICML.

[32]  Matjaz Dolsek,et al.  Incremental dynamic analysis with consideration of modeling uncertainties , 2009 .

[33]  Bruce R. Ellingwood,et al.  Seismic fragilities for non-ductile reinforced concrete frames – Role of aleatoric and epistemic uncertainties , 2010 .

[34]  Bernhard Schölkopf,et al.  Kernel Principal Component Analysis , 1997, ICANN.

[35]  Christos Boutsidis,et al.  Randomized Dimensionality Reduction for $k$ -Means Clustering , 2011, IEEE Transactions on Information Theory.

[36]  Inderjit S. Dhillon,et al.  Memory Efficient Kernel Approximation , 2014, ICML.

[37]  Volkan Cevher,et al.  Fixed-Rank Approximation of a Positive-Semidefinite Matrix from Streaming Data , 2017, NIPS.

[38]  Stephen Becker,et al.  Randomized Clustered Nystrom for Large-Scale Kernel Machines , 2016, AAAI.

[39]  Cameron Musco,et al.  Recursive Sampling for the Nystrom Method , 2016, NIPS.

[40]  Victor E. Saouma,et al.  Collapse Fragility Curves for Concrete Dams: Comprehensive Study , 2016 .

[41]  Xiaodong Li,et al.  Model-free Nonconvex Matrix Completion: Local Minima Analysis and Applications in Memory-efficient Kernel PCA , 2019, J. Mach. Learn. Res..

[42]  Andreas Loukas,et al.  Approximating Spectral Clustering via Sampling: a Review , 2019, Sampling Techniques for Supervised or Unsupervised Tasks.

[43]  Michael W. Mahoney,et al.  Revisiting the Nystrom Method for Improved Large-scale Machine Learning , 2013, J. Mach. Learn. Res..

[44]  Mohiuddin Ahmed Data summarization: a survey , 2018, Knowledge and Information Systems.

[45]  Ivor W. Tsang,et al.  A Family of Simple Non-Parametric Kernel Learning Algorithms , 2011, J. Mach. Learn. Res..

[46]  Andres Bustillo,et al.  High-accuracy classification of thread quality in tapping processes with ensembles of classifiers for imbalanced learning , 2021 .

[47]  Petr Hurtik,et al.  Novel dimensionality reduction approach for unsupervised learning on small datasets , 2020, Pattern Recognit..

[48]  Michael Elad,et al.  Linearized Kernel Dictionary Learning , 2015, IEEE Journal of Selected Topics in Signal Processing.

[49]  M. A. Hariri-Ardebili,et al.  Efficient seismic reliability analysis of large-scale coupled systems including epistemic and aleatory uncertainties , 2019, Soil Dynamics and Earthquake Engineering.

[50]  Shusen Wang,et al.  Scalable Kernel K-Means Clustering with Nystrom Approximation: Relative-Error Bounds , 2017, J. Mach. Learn. Res..

[51]  Farhad Pourkamali-Anaraki,et al.  Scalable Spectral Clustering With Nyström Approximation: Practical and Theoretical Aspects , 2020, IEEE Open Journal of Signal Processing.