论文信息 - Group Learning for High-Dimensional Sparse Data

Group Learning for High-Dimensional Sparse Data

We describe new methodology for supervised learning with sparse data, i.e., when the number of input features is (much) larger than the number of training samples (n). Under the proposed approach, all available (d) input features are split into several (t) subsets, effectively resulting in a larger number (t*n) of labeled training samples in lower-dimensional input space (of dimensionality d/t). This (modified) training data is then used to estimate a classifier for making predictions in lower-dimensional space. In this paper, standard SVM is used for training a classifier. During testing (prediction), a group of t predictions made by SVM classifier needs to be combined via intelligent post-processing rules, in order to make a prediction for a test input (in the original d-dimensional space). The novelty of our approach is in the design and empirical validation of these post-processing rules under Group Learning setting. We demonstrate that such post-processing rules effectively reflect general (common-sense) a priori knowledge (about application data). Specifically, we propose two different post-processing schemes and demonstrate their effectiveness for two real-life application domains, i.e., handwritten digit recognition and seizure prediction from iEEG signal. These empirical results show superior performance of the Group Learning approach for sparse data, under both balanced and unbalanced classification settings

[1] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[2] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[3] Rauf Izmailov,et al. Rethinking statistical learning theory: learning using statistical invariants , 2018, Machine Learning.

[4] Isabelle Guyon,et al. An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[5] Ron Kohavi,et al. Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[6] Benjamin H. Brinkmann,et al. SVM-Based System for Prediction of Epileptic Seizures From iEEG Signal , 2017, IEEE Transactions on Biomedical Engineering.

[7] Jiawei Han,et al. Generalized Fisher Score for Feature Selection , 2011, UAI.

[8] R. Schlittgen,et al. A weighted least-squares approach to clusterwise regression , 2011 .