HSPXY: A hybrid‐correlation and diversity‐distances based data partition method

A representative dataset is crucial to build a robust and generalized machine learning model, especially for small databases. Correlation is not usually considered in distance‐based set partition methods; therefore, distant yet correlated samples might be incorrectly assigned. An improved sample subset partition method based on joint hybrid correlation and diversity x‐y distances (HSPXY) is proposed in the framework of the sample set partition based on joint x‐y distances (SPXY). Therein, a hybrid distance consisting of both cosine angle distance and Euclidean distance in variable spaces cooperates the correlation of samples in the distance‐based set partition method. To compare with some existing partition methods, partial least squares (PLS) regression models are built on four set partition methods, random sampling (RS), Kennard‐Stone (KS), SPXY, and HSPXY. Upon the applications on small chemical databases, the proposed HSPXY algorithm‐based models achieved smaller root mean square errors and better coefficients of determination than other tested set partition methods, which indicates the training set is well represented. This suggests the proposed algorithm provides a new option to obtain a representative calibration set. Sample subset partition is widely considered in machine learning modeling. An improved sample subset partition method based on a hybrid correlation and diversity x‐y distance (HSPXY) is proposed in the framework of SPXY. Cosine angle distance and Euclidean distance in variable spaces are used to represent the correlation and diversity of samples, respectively. To explore the effectiveness of HSPXY, PLS models are built on four set partition methods, RS, KS, SPXY, and HSPXY. The models based on the proposed HSPXY algorithm carried the overall best result among all regression models, which suggests the proposed algorithm may be taken as an alternative to other existing data partition methods.

[1]  P. Hobza Calculations on noncovalent interactions and databases of benchmark interaction energies. , 2012, Accounts of chemical research.

[2]  Paola Gramatica,et al.  Principles of QSAR models validation: internal and external , 2007 .

[3]  Jennifer Seberry,et al.  D-optimal designs , 2011 .

[4]  Desire L. Massart,et al.  Artificial neural networks in classification of NIR spectral data: Design of the training set , 1996 .

[5]  Pavel Hobza,et al.  S66: A Well-balanced Database of Benchmark Interaction Energies Relevant to Biomolecular Structures , 2011, Journal of chemical theory and computation.

[6]  J. Zupan,et al.  Separation of data on the training and test set for modelling: a case study for modelling of five colour properties of a white pigment , 2003 .

[7]  Shamik Sural,et al.  A comparative analysis of two distance measures in color image databases , 2002, Proceedings. International Conference on Image Processing.

[8]  Martha Larson,et al.  Collaborative Filtering beyond the User-Item Matrix , 2014, ACM Comput. Surv..

[9]  Qing-Song Xu,et al.  libPLS: An integrated library for partial least squares regression and linear discriminant analysis , 2018 .

[10]  William J. Welch,et al.  Computer-aided design of experiments , 1981 .

[11]  Pavel Hobza,et al.  Benchmark Calculations of Noncovalent Interactions of Halogenated Molecules. , 2012, Journal of chemical theory and computation.

[12]  K. Héberger,et al.  Sum of ranking differences for method discrimination and its validation: comparison of ranks with random numbers , 2011 .

[13]  Hui Li,et al.  A cascaded QSAR model for efficient prediction of overall power conversion efficiency of all‐organic dye‐sensitized solar cells , 2015, J. Comput. Chem..

[14]  K. Héberger Sum of ranking differences compares methods or models fairly , 2010 .

[15]  Anthony C. Atkinson,et al.  Beyond response surfaces: recent developments in optimum experimental design , 1995 .

[16]  Hui Xiong,et al.  Hyperclique pattern discovery , 2006, Data Mining and Knowledge Discovery.

[17]  Jan Baumbach,et al.  On the performance of pre-microRNA detection algorithms , 2017, Nature Communications.

[18]  Hui Li,et al.  SPXYE: an improved method for partitioning training and validation sets , 2019, Cluster Computing.

[19]  M. Forina,et al.  Multivariate calibration. , 2007, Journal of chromatography. A.

[20]  Mohamed Limam,et al.  A kernel distance-based representative subset selection method , 2016 .

[21]  Holger R. Maier,et al.  Optimal division of data for neural network models in water resources applications , 2002 .

[22]  Y. Roggo,et al.  A review of near infrared spectroscopy and chemometrics in pharmaceutical technologies. , 2007, Journal of pharmaceutical and biomedical analysis.

[23]  M. S. Khots,et al.  D-optimal designs , 1995 .

[24]  Moses O. Tadé,et al.  A Modified Kennard-Stone Algorithm for Optimal Division of Data for Developing Artificial Neural Network Models , 2012 .

[25]  Roberto Kawakami Harrop Galvão,et al.  A method for calibration and validation subset partitioning. , 2005, Talanta.

[26]  Yinghua Lu,et al.  Correlation and redundancy on machine learning performance for chemical databases , 2018 .

[27]  Hui Li,et al.  A machine learning correction for DFT non-covalent interactions based on the S22, S66 and X40 benchmark databases , 2016, Journal of Cheminformatics.

[28]  Liping Han,et al.  Distance Weighted Cosine Similarity Measure for Text Classification , 2013, IDEAL.