Review of classical dimensionality reduction and sample selection methods for large-scale data processing

Abstract In the era of big data, all types of data with increasing samples and high-dimensional attributes are demonstrating their important roles in various fields, such as data mining, pattern recognition and machine learning, etc. Meanwhile, machine learning algorithms are being effectively applied in large-scale data processing. This paper mainly reviews the classical dimensionality reduction and sample selection methods based on machine learning algorithms for large-scale data processing. Firstly, the paper provides a brief overview to the classical sample selection and dimensionality reduction methods. Then, it pays attention to the applications of those methods and their combinations with the classical machine learning methods, such as clustering, random forest, fuzzy set, and heuristic algorithms, particularly deep leaning methods. Furthermore, the paper primarily introduces the application frameworks that combine sample selection and dimensionality reduction in the context of two aspects: sequential and simultaneous, which almost all get the ideal results in the processing of the large-scale training data contrasting to the original models. Lastly, we further conclude that sample selection and dimensionality reduction methods are essential and effective for the modern large-scale data processing. In the future work, the machine learning algorithms, especially the deep learning methods, will play a more important role in the processing of large-scale data.

[1]  Ivor W. Tsang,et al.  Towards ultrahigh dimensional feature selection for big data , 2012, J. Mach. Learn. Res..

[2]  Xiao Zhang,et al.  Sample selection with rough set , 2010, 2010 International Conference on Machine Learning and Cybernetics.

[3]  Xuelong Li,et al.  Unsupervised Feature Selection with Structured Graph Optimization , 2016, AAAI.

[4]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[5]  Christopher J. C. Burges,et al.  Dimension Reduction: A Guided Tour , 2010, Found. Trends Mach. Learn..

[6]  Gianni D'Angelo,et al.  Feature Extraction and Soft Computing Methods for Aerospace Structure Defect Classification , 2016, ArXiv.

[7]  David J. C. MacKay,et al.  Information-Based Objective Functions for Active Data Selection , 1992, Neural Computation.

[8]  Pierre Vandergheynst,et al.  On adaptive pixel random selection for compressive sensing , 2014, 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[9]  Yi Yang,et al.  Multi-Class Active Learning by Uncertainty Sampling with Diversity Maximization , 2015, International Journal of Computer Vision.

[10]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[11]  Ching Y. Suen,et al.  A novel hybrid CNN-SVM classifier for recognizing handwritten digits , 2012, Pattern Recognit..

[12]  Xiaohong Su,et al.  PSO-based feature extraction for high dimension small sample , 2012, 2012 IEEE Fifth International Conference on Advanced Computational Intelligence (ICACI).

[13]  Marek Lóderer,et al.  Data dimension reduction in training strategy for face recognition system , 2014, IWSSIP 2014 Proceedings.

[14]  François Fleuret,et al.  Jointly Informative Feature Selection Made Tractable by Gaussian Modeling , 2016, J. Mach. Learn. Res..

[15]  Nicolás García-Pedrajas,et al.  Democratic instance selection: A linear complexity instance selection algorithm based on classifier ensemble concepts , 2010, Artif. Intell..

[16]  Fabian J. Theis,et al.  Correcting Classifiers for Sample Selection Bias in Two-Phase Case-Control Studies , 2017, Comput. Math. Methods Medicine.

[17]  Nikhil R. Pal,et al.  An Integrated Mechanism for Feature Selection and Fuzzy Rule Extraction for Classification , 2012, IEEE Transactions on Fuzzy Systems.

[18]  Yan Guo,et al.  Selecting Training Samples from Large-Scale Remote-Sensing Samples Using an Active Learning Algorithm , 2015, ISICA.

[19]  Rui Xia,et al.  Feature Ensemble Plus Sample Selection: Domain Adaptation for Sentiment Classification , 2013, IEEE Intelligent Systems.

[20]  Sach Mukherjee,et al.  A Gibbs Sampler for Learning DAGs , 2016, J. Mach. Learn. Res..

[21]  Bianca Zadrozny,et al.  Learning and evaluating classifiers under sample selection bias , 2004, ICML.

[22]  Forrest W. Young Multidimensional Scaling: History, Theory, and Applications , 1987 .

[23]  Hongyuan Zha,et al.  Active Manifold Learning via Gershgorin Circle Guided Sample Selection , 2015, AAAI.

[24]  Su Ruan,et al.  Robust feature selection to predict tumor treatment outcome , 2014, Artif. Intell. Medicine.

[25]  Ning Chen,et al.  Nonlinear Feature Extraction with Max-Margin Data Shifting , 2016, AAAI.

[26]  Bo Du,et al.  Random-Selection-Based Anomaly Detector for Hyperspectral Imagery , 2011, IEEE Transactions on Geoscience and Remote Sensing.

[27]  S. Shankar Sastry,et al.  Dissimilarity-Based Sparse Subset Selection , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Chulhee Lee,et al.  Incremental feature extraction based on decision boundaries , 2018, Pattern Recognit..

[29]  Tieniu Tan,et al.  Simultaneous Feature and Sample Reduction for Image-Set Classification , 2016, AAAI.

[30]  Xuezeng Pan,et al.  A New Method of Training Sample Selection in Text Classification , 2010, 2010 Second International Workshop on Education Technology and Computer Science.

[31]  Frédéric Precioso,et al.  Improving SVM Training Sample Selection Using Multi-Objective Evolutionary Algorithm and LSH , 2015, 2015 IEEE Symposium Series on Computational Intelligence.

[32]  Steve Hanneke,et al.  The Optimal Sample Complexity of PAC Learning , 2015, J. Mach. Learn. Res..

[33]  Wei Sun,et al.  All-dimension neighborhood based particle swarm optimization with randomly selected neighbors , 2017, Inf. Sci..

[34]  Nassir Navab,et al.  Survival analysis for high-dimensional, heterogeneous medical data: Exploring feature extraction as an alternative to feature selection , 2016, Artif. Intell. Medicine.

[35]  Feng Li,et al.  A novel geometric feature extraction method for ear recognition , 2016, Expert Syst. Appl..

[36]  Jun Zhang,et al.  Dynamic frequency feature selection based approach for classification of motor imageries , 2016, Comput. Biol. Medicine.

[37]  Jane You,et al.  Robust Manifold Matrix Factorization for Joint Clustering and Feature Extraction , 2017, AAAI.

[38]  Wenbin Li,et al.  Two-stage clustering based effective sample selection for classification of pre-miRNAs , 2010, 2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[39]  Tomasz Lukaszuk,et al.  The feature selection bias problem in relation to high-dimensional gene data , 2016, Artif. Intell. Medicine.

[40]  Xiao Li,et al.  Sample selection for visual domain adaptation via sparse coding , 2016 .

[41]  Daoqiang Zhang,et al.  Multi-modal multi-task learning for joint prediction of multiple regression and classification variables in Alzheimer's disease , 2012, NeuroImage.

[42]  Charles Elkan,et al.  Making generative classifiers robust to selection bias , 2007, KDD '07.

[43]  Duncan Fyfe Gillies,et al.  Overfitting in linear feature extraction for classification of high-dimensional image data , 2016, Pattern Recognit..

[44]  Jieping Ye,et al.  Efficient nonconvex sparse group feature selection via continuous and discrete optimization , 2015, Artif. Intell..

[45]  Ke Xu,et al.  Unsupervised Feature Selection by Heuristic Search with Provable Bounds on Suboptimality , 2016, AAAI.

[46]  D. Ramyachitra,et al.  Interval-value Based Particle Swarm Optimization algorithm for cancer-type specific gene selection and sample classification , 2015, Genomics data.

[47]  J. Heckman Sample selection bias as a specification error , 1979 .

[48]  Adrião Duarte Dória Neto,et al.  Comparative study on dimension reduction techniques for cluster analysis of microarray data , 2011, The 2011 International Joint Conference on Neural Networks.

[49]  Xizhao Wang,et al.  Maximum Ambiguity-Based Sample Selection in Fuzzy Decision Tree Induction , 2012, IEEE Transactions on Knowledge and Data Engineering.

[50]  Jianhua Xu,et al.  A weighted linear discriminant analysis framework for multi-label feature extraction , 2018, Neurocomputing.

[51]  María Lourdes Borrajo Diz,et al.  Building Biomedical Text Classifiers under Sample Selection Bias , 2011, DCAI.

[52]  Robert Jenssen,et al.  Training Echo State Networks with Regularization Through Dimensionality Reduction , 2016, Cognitive Computation.

[53]  Donghai Guan,et al.  Initial training data selection for active learning , 2011, ICUIMC '11.

[54]  Yalda Mohsenzadeh,et al.  Incremental relevance sample-feature machine: A fast marginal likelihood maximization approach for joint feature selection and classification , 2016, Pattern Recognit..

[55]  Victor-Emil Neagoe,et al.  Feature selection with Ant Colony Optimization and its applications for pattern recognition in space imagery , 2016, 2016 International Conference on Communications (COMM).

[56]  Kok-Leong Ong,et al.  Feature selection for high dimensional imbalanced class data using harmony search , 2017, Eng. Appl. Artif. Intell..

[57]  Faouzi Mhamdi,et al.  Feature Selection Methods on Biological Knowledge Discovery and Data Mining: A Survey , 2014, 2014 25th International Workshop on Database and Expert Systems Applications.

[58]  Hui Wei,et al.  V4 Neural Network Model for Shape-Based Feature Extraction and Object Discrimination , 2015, Cognitive Computation.

[59]  Majid Komeili,et al.  Local Feature Selection for Data Classification , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[60]  Xuezeng Pang,et al.  A text classification model based on training sample selection and feature weight adjustement , 2010, 2010 2nd International Conference on Advanced Computer Control.

[61]  Fei Wang,et al.  Survey on distance metric learning and dimensionality reduction in data mining , 2014, Data Mining and Knowledge Discovery.

[62]  Everardo Santiago-Ramírez,et al.  Optimization-based methodology for training set selection to synthesize composite correlation filters for face recognition , 2016, Signal Process. Image Commun..

[63]  Baoxin Li,et al.  Clustering-Based Joint Feature Selection for Semantic Attribute Prediction , 2016, IJCAI.

[64]  Adrian Barbu,et al.  Feature Selection with Annealing for Big Data Learning , 2013 .

[65]  Huan Liu,et al.  Multi-Label Informed Feature Selection , 2016, IJCAI.

[66]  Bo Li,et al.  Feature extraction using maximum nonparametric margin projection , 2016, Neurocomputing.

[67]  Kristen Grauman,et al.  Connecting the Dots with Landmarks: Discriminatively Learning Domain-Invariant Features for Unsupervised Domain Adaptation , 2013, ICML.

[68]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[69]  Xiaogang Wang,et al.  Random Sampling for Subspace Face Recognition , 2006, International Journal of Computer Vision.

[70]  Li Yao,et al.  Correcting sample selection bias for image classification , 2008, 2008 3rd International Conference on Intelligent System and Knowledge Engineering.

[71]  Ke Lu,et al.  Joint Feature Selection and Structure Preservation for Domain Adaptation , 2016, IJCAI.

[72]  Hao Hong Training Sample Selection Method for Neural Networks Based on Nearest Neighbor Rule , 2007 .

[73]  Nadia Abd-Alsabour,et al.  A Review on Evolutionary Feature Selection , 2014, 2014 European Modelling Symposium.

[74]  Taghi M. Khoshgoftaar,et al.  Data quality in data mining and machine learning , 2007 .

[75]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[76]  Ning Zhang,et al.  A sample selection algorithm based on maximum entropy and contribution , 2010, 2010 International Conference on Machine Learning and Cybernetics.

[77]  David A. Landgrebe,et al.  Feature Extraction Based on Decision Boundaries , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[78]  Tze-Yun Leong,et al.  Online Feature Selection for Model-based Reinforcement Learning , 2013, ICML.

[79]  Ehsan Adeli,et al.  Joint feature-sample selection and robust diagnosis of Parkinson's disease from MRI data , 2016, NeuroImage.

[80]  Peter E. Hart,et al.  The condensed nearest neighbor rule (Corresp.) , 1968, IEEE Trans. Inf. Theory.

[81]  Feiping Nie,et al.  Orthogonal least squares regression for feature extraction , 2016, Neurocomputing.

[82]  Menaka Chellasamy,et al.  Automatic Training Sample Selection for a Multi-Evidence Based Crop Classification Approach , 2014 .

[83]  Jun-Hai Zhai,et al.  Sample Selection Based on K-L Divergence for Effectively Training SVM , 2013, 2013 IEEE International Conference on Systems, Man, and Cybernetics.

[84]  Sharifalillah Nordin,et al.  Review of dimensionality reduction techniques using clustering algorithm in reconstruction of gene regulatory networks , 2015, 2015 International Conference on Computer, Communications, and Control Technology (I4CT).

[85]  Ke Xu,et al.  Weighted A* Algorithms for Unsupervised Feature Selection with Provable Bounds on Suboptimality , 2016, AAAI.

[86]  Kilian Q. Weinberger,et al.  An Introduction to Nonlinear Dimensionality Reduction by Maximum Variance Unfolding , 2006, AAAI.

[87]  Yalda Mohsenzadeh,et al.  The Relevance Sample-Feature Machine: A Sparse Bayesian Learning Approach to Joint Feature-Sample Selection , 2013, IEEE Transactions on Cybernetics.

[88]  Ivor W. Tsang,et al.  Making Trillion Correlations Feasible in Feature Grouping and Selection , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[89]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[90]  Feng Yan,et al.  A fast training algorithm for support vector machine via boundary sample selection , 2003, International Conference on Neural Networks and Signal Processing, 2003. Proceedings of the 2003.

[91]  Daniel S. Yeung,et al.  Input sample selection for RBF neural network classification problems using sensitivity measure , 2003, SMC'03 Conference Proceedings. 2003 IEEE International Conference on Systems, Man and Cybernetics. Conference Theme - System Security and Assurance (Cat. No.03CH37483).

[92]  Dinggang Shen,et al.  Neurodegenerative disease diagnosis using incomplete multi-modality data via matrix shrinkage and completion , 2014, NeuroImage.

[93]  Shunxiang Wu,et al.  Online Multi-label Group Feature Selection , 2017, Knowl. Based Syst..

[94]  A. Lee Swindlehurst,et al.  Direct feature extraction from multi-electrode recordings for spike sorting , 2018, Digit. Signal Process..

[95]  Ming-Ai Li,et al.  A novel feature extraction method for scene recognition based on Centered Convolutional Restricted Boltzmann Machines , 2015, Neurocomputing.