A kernel distance-based representative subset selection method

This paper proposes a new representative subset selection method called kernel distance-based sample set partitioning based on joint x–y distances, referred to as KSPXY. The proposed method is a modified version of the original sample set partitioning based on joint x–y distances (SPXY) algorithm, where the kernel distance is used as an alternative to the Euclidean distance. The proposed KSPXY algorithm is used with partial least-squares (PLS) to predict three chemical quality characteristics of diesel fuel. We compare the PLS-KSPXY modelling strategy with two modelling strategies involving the use of SPXY and Kennard–Stone (KS) algorithms with PLS. Based on the root mean-squared error of prediction, results show that the proposed KSPXY algorithm performs better than SPXY and KS algorithms in improving the predictive ability of the PLS model. The difference between PLS–KSPXY and the other two modelling strategies is statistically significant. The paper provides also the MATLAB code for the proposed KSPXY algorithm, developed by the authors.

[1]  Roberto Kawakami Harrop Galvão,et al.  A method for calibration and validation subset partitioning. , 2005, Talanta.

[2]  Suresh Venkatasubramanian,et al.  Comparing distributions and shapes using the kernel distance , 2010, SoCG '11.

[3]  Oxana Ye. Rodionova,et al.  Subset selection strategy , 2008 .

[4]  Xueguang Shao,et al.  Representative subset selection in modified iterative predictor weighting (mIPW) — PLS models for parsimonious multivariate calibration , 2007 .

[5]  G. Puchwein Selection of calibration samples for near-infrared spectrometry by factor analysis of spectra , 1988 .

[6]  Hans-Peter Kriegel,et al.  Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications , 1998, Data Mining and Knowledge Discovery.

[7]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[8]  Edward J. Wegman,et al.  A parallel algorithm for subset selection , 1998 .

[9]  Robert D. Clark,et al.  OptiSim: An Extended Dissimilarity Selection Method for Finding Diverse Representative Subsets , 1997, J. Chem. Inf. Comput. Sci..

[10]  Ronald D. Snee,et al.  Validation of Regression Models: Methods and Examples , 1977 .

[11]  Qing-Song Xu,et al.  libPLS: An integrated library for partial least squares regression and linear discriminant analysis , 2018 .

[12]  Nicu Sebe,et al.  Distance Learning for Similarity Estimation , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Bernard De Baets,et al.  Subset selection from multi-experiment data sets with application to milk fatty acid profiles , 2010 .

[14]  William J. Welch,et al.  Computer-aided design of experiments , 1981 .

[15]  Yukio Tominaga,et al.  Representative subset selection using genetic algorithms , 1998 .

[16]  Colin Campbell,et al.  Kernel methods: a survey of current techniques , 2002, Neurocomputing.

[17]  Desire L. Massart,et al.  Representative subset selection , 2002 .

[18]  Celio Pasquini,et al.  A strategy for selecting calibration samples for multivariate modelling , 2004 .

[19]  E. V. Thomas,et al.  Partial least-squares methods for spectral analyses. 1. Relation to other quantitative calibration methods and the extraction of qualitative information , 1988 .

[20]  M. Forina,et al.  Multivariate calibration. , 2007, Journal of chromatography. A.