Toward an Optimal Procedure for PC-ANN Model Building: Prediction of the Carcinogenic Activity of a Large Set of Drugs

The performances of the three novel QSAR algorithms, principal component-artificial neural network modeling method combining with three factor selection procedures named eigenvalue ranking, correlation ranking, and genetic algorithm (ER-PC-ANN, CR-PC-ANN, PC-GA-ANN, respectively), are compared by application of these model to the prediction of the carcinogenic activity of a large set of drugs (735 drugs) belonging to a diverse type of compounds. A total number of 1350 theoretical descriptors are calculated for each molecule. The matrix of calculated descriptors (with 735 x 1350 dimension) is subjected to PCA. 95% of the variances in the matrix are explained by the first 137 principal components (PC's). From the pool of 137 PC's, the factor selection methods (ER, CR, and GA) are employed to select the best set of PC's for PC-ANN modeling. In the ER-PC-ANN, the PC's are successively entered into the ANN based on their decreasing eigenvalue. In the CR-PC-ANN, the ANN is first employed to model the nonlinear relationship between each one of the PC's and the carcinogen activity separately. Then, the PC's are ranked based on their decreasing correlating ability and entered to the input layer of the network one after another. Finally, a search algorithm (i.e. genetic algorithm) is used to find the best set of PC's. Both the external and cross-validation methods are used to validate the performances of the resulting models. One is able to see that the results obtained by the PC-GA-ANN and CR-PC-ANN procedures are superior to those resulted from the EV-PC-ANN. Comparison of the results reveals that the results produced by the PC-GA-ANN algorithm are better than those produced by CR-PC-ANN. However, the difference is not significant.