High‐dimensional spectral data classification with nonparametric feature screening

Two nonparametric feature screening methods, namely, the Kolmogorov filter and model free, marginally measure the relationship between categorical response and predictor variables without the parametrical assumption. And they can select important variables in the high‐dimensional classification data. Random forest, as a classical nonparametric method, can solve various classification problems. In this paper, we combine the two nonparametric feature screening methods with random forest to handle with spectral data classification. And then other conventional classification methods are compared with ours on three spectral datasets. The comparison results illustrated that our methods have more desirable ability about classification performance and variable selection than other methods.

[1]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[2]  Jana Sáde Cká,et al.  Fluorescence Spectroscopy and Chemometrics in the Food Classification − a Review , 2007 .

[3]  Qingsong Xu,et al.  Correlation‐assisted nearest shrunken centroid classifier with applications for high dimensional spectral data , 2016 .

[4]  Jianqing Fan,et al.  High Dimensional Classification Using Features Annealed Independence Rules. , 2007, Annals of statistics.

[5]  Jian Huang,et al.  Penalized feature selection and classification in bioinformatics , 2008, Briefings Bioinform..

[6]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[7]  Haidy A. Gad,et al.  Application of chemometrics in authentication of herbal medicines: a review. , 2013, Phytochemical analysis : PCA.

[8]  K. Héberger,et al.  Towards better understanding of lipophilicity: assessment of in silico and chromatographic logP measures for pharmaceutically important compounds by nonparametric rankings. , 2015, Journal of pharmaceutical and biomedical analysis.

[9]  Peter Bühlmann Regression shrinkage and selection via the Lasso: a retrospective (Robert Tibshirani): Comments on the presentation , 2011 .

[10]  Qingsong Xu,et al.  A selective review and comparison for interval variable selection in spectroscopic modeling , 2017 .

[11]  Trevor Hastie,et al.  Class Prediction by Nearest Shrunken Centroids, with Applications to DNA Microarrays , 2003 .

[12]  Bruno Lacroix,et al.  Automatic identification of mixed bacterial species fingerprints in a MALDI-TOF mass-spectrum , 2014, Bioinform..

[13]  J. Roca-Pardiñas,et al.  Determining optimum wavelengths for leaf water content estimation from reflectance: A distance correlation approach , 2018 .

[14]  K. Héberger Sum of ranking differences compares methods or models fairly , 2010 .

[15]  W Y Zhang,et al.  Discussion on `Sure independence screening for ultra-high dimensional feature space' by Fan, J and Lv, J. , 2008 .

[16]  Donghyeon Yu,et al.  Classification of spectral data using fused lasso logistic regression , 2015 .

[17]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[18]  Hui Zou,et al.  The fused Kolmogorov filter: A nonparametric model-free screening method , 2014, 1403.7701.

[19]  Menghui H. Zhang,et al.  Application of boosting to classification problems in chemometrics , 2005 .

[20]  Peter Filzmoser,et al.  Review of sparse methods in regression and classification with application to chemometrics , 2012 .

[21]  Knut Baumann,et al.  Screening for linearly and nonlinearly related variables in predictive cheminformatic models , 2018 .

[22]  Yizeng Liang,et al.  Chemometric methods in data processing of mass spectrometry-based metabolomics: A review. , 2016, Analytica chimica acta.

[23]  Hans-Georg Müller,et al.  Functional Data Analysis , 2016 .

[24]  Qing-Song Xu,et al.  Support vector machines and its applications in chemistry , 2009 .

[25]  Runze Li,et al.  Feature Screening via Distance Correlation Learning , 2012, Journal of the American Statistical Association.

[26]  Yizeng Liang,et al.  Exploring nonlinear relationships in chemical data using kernel-based methods , 2011 .

[27]  Yichao Wu,et al.  Ultrahigh Dimensional Feature Selection: Beyond The Linear Model , 2009, J. Mach. Learn. Res..

[28]  R. Tibshirani,et al.  Regression shrinkage and selection via the lasso: a retrospective , 2011 .

[29]  Margaret A. Nemeth,et al.  Multi- and Megavariate Data Analysis , 2003, Technometrics.

[30]  Runze Li,et al.  Model-Free Feature Screening for Ultrahigh Dimensional Discriminant Analysis , 2015, Journal of the American Statistical Association.

[31]  Sunduz Keles,et al.  Sparse Partial Least Squares Classification for High Dimensional Data , 2010, Statistical applications in genetics and molecular biology.

[32]  W. Cheang,et al.  Penalized logistic regression for classification and feature selection with its application to detection of two official species of Ganoderma , 2017 .

[33]  Danh V. Nguyen,et al.  Multi-class cancer classification via partial least squares with gene expression profiles , 2002, Bioinform..

[34]  David I. Ellis,et al.  A tutorial review: Metabolomics and partial least squares-discriminant analysis--a marriage of convenience or a shotgun wedding. , 2015, Analytica chimica acta.

[35]  H. G. Hong,et al.  The Lq- NORM LEARNING FOR ULTRAHIGH-DIMENSIONAL SURVIVAL DATA: AN INTEGRATIVE FRAMEWORK. , 2020, Statistica Sinica.

[36]  Jianqing Fan,et al.  Sure independence screening for ultrahigh dimensional feature space , 2006, math/0612857.

[37]  S. Keleş,et al.  Sparse partial least squares regression for simultaneous dimension reduction and variable selection , 2010, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[38]  Hui Zou,et al.  The Kolmogorov filter for variable screening in high-dimensional binary classification , 2013 .

[39]  Maria L. Rizzo,et al.  Measuring and testing dependence by correlation of distances , 2007, 0803.4101.

[40]  Jean-Philippe Vert,et al.  Benchmark of structured machine learning methods for microbial identification from mass-spectrometry data , 2015, ArXiv.