Comparative study of class data analysis with PCA-LDA, SIMCA, PLS, ANNs, and k-NN

Abstract Three types of chemotherapeutic agents, antibacterials, antineoplastics, and antifungals, which are registered in the MDL drug data report (MDDR) database, were used as training data set, and the classification study was performed using the following seven methods: principal component analysis–linear discriminant analysis (PCA-LDA), soft independent modeling by class analogy (SIMCA), partial least-squares2 (PLS2), artificial neural networks (ANNs), nearest neighbor method (NN), combined method of Ward clustering and NN (W-NN), and combined method of genetic algorithms (GAs) and NN (GA-NN). The number of correctly classified samples for each method was decreased by the following order: NN, ANNs, GA-NN, SIMCA, PLS2, W-NN, and PCA-LDA. Using these models, prediction study was then performed for the test set which consists of the drugs registered in the comprehensive medicinal chemistry (CMC) database. The number of correctly predicted samples for each method was decreased by the following order: NN, GA-NN, W-NN, SIMCA, PCA-LDA, ANNs, and PLS2. NN gave the best model from view points of the classification and prediction while overfitting was observed in ANNs and PLS2. Although the fitness and predictiveness of GA-NN and W-NN were inferior to those of NN, the predictiveness of the two methods were superior to PCA-LDA, SIMCA, ANNs, and PLS2.

[1]  Yukio Tominaga,et al.  Representative subset selection using genetic algorithms , 1998 .

[2]  Fionn Murtagh,et al.  A Survey of Recent Advances in Hierarchical Clustering Algorithms , 1983, Comput. J..

[3]  Yukio Tominaga Novel 3D Descriptors Using Excluded Volume 2: Application to Drug Classification , 1998, J. Chem. Inf. Comput. Sci..

[4]  S. D. Jong,et al.  The kernel PCA algorithms for wide data. Part I: Theory and algorithms , 1997 .

[5]  D. Coomans,et al.  Recent developments in discriminant analysis on high dimensional spectral data , 1996 .

[6]  D. Coomans,et al.  The application of linear discriminant analysis in the diagnosis of thyroid diseases , 1978 .

[7]  B. Kowalski,et al.  Partial least-squares regression: a tutorial , 1986 .

[8]  Yvonne C. Martin,et al.  Use of Structure-Activity Data To Compare Structure-Based Clustering Methods and Descriptors for Use in Compound Selection , 1996, J. Chem. Inf. Comput. Sci..

[9]  S. Wold,et al.  Application of simca multivariate data analysis to the classification of gas chromatographic profiles of human brain tissues , 1981 .

[10]  Svante Wold,et al.  Pattern recognition by means of disjoint principal components models , 1976, Pattern Recognit..

[11]  Shin-ichi Sasaki,et al.  Chemical pattern recognition and multivariate analysis for QSAR studies , 1993 .

[12]  Svante Wold,et al.  Multivariate quantitative structure-activity relationships (QSAR): conditions for their applicability , 1983, J. Chem. Inf. Comput. Sci..

[13]  S. So,et al.  Application of neural networks: quantitative structure-activity relationships of the derivatives of 2,4-diamino-5-(substituted-benzyl)pyrimidines as DHFR inhibitors. , 1992, Journal of medicinal chemistry.

[14]  C. B. Lucasius,et al.  Understanding and using genetic algorithms Part 1. Concepts, properties and context , 1993 .

[15]  H. Lohninger,et al.  Classification of mass spectra: A comparison of yes/no classification methods for the recognition of simple structural properties , 1994 .

[16]  J. Zupan,et al.  Neural networks: A new method for solving chemical problems or just a passing phase? , 1991 .

[17]  Desire L. Massart,et al.  Comparison of regularized discriminant analysis linear discriminant analysis and quadratic discriminant analysis applied to NIR data , 1996 .

[18]  W. Dunn,et al.  Principal components analysis and partial least squares regression , 1989 .

[19]  S Wold,et al.  A structure-carcinogenicity study of 4-nitroquinoline 1-oxides using the SIMCA method of pattern recognition. , 1978, Journal of medicinal chemistry.

[20]  Yukio Tominaga,et al.  Data Structure Comparison Using Box Counting Analysis , 1998, J. Chem. Inf. Comput. Sci..