A Comparison of Dimensionality Reduction Techniques in Virtual Screening

Most of the screening methods have always struggled to deal with the high dimensionality of data in virtual screening task. One of the most commonly used techniques to reduce the high dimensional data is principal component analysis (PCA). PCA and its variants have been introduced and re-introduced to solve the problems in particular tasks in real world applications. In this paper, PCA and four variants of it are compared and analyzed together in virtual screening task in particular using fingerprint representation. Fingerprint is one of the most regularly used descriptors in virtual screening task. None of these methods have never been compared and studied together with high dimensional and binary-valued data elsewhere. The results show superiority of the variants of PCA to PCA on the most heterogeneous classes, while the methods are competitive to PCA on the homogeneous classes. Supervised PCA is found to be the best technique and is competitive to Fisher discriminant analysis. It should be noted that Fisher discriminant analysis uses all the provided information while Supervised PCA uses only few components.

[1]  Jan de Leeuw,et al.  Principal component analysis of binary data by iterated singular value decomposition , 2006, Comput. Stat. Data Anal..

[2]  Ata Kabán,et al.  Learning to Read Between the Lines: The Aspect Bernoulli Model , 2004, SDM.

[3]  David R. Cox The analysis of binary data , 1970 .

[4]  Lawrence K. Saul,et al.  A Generalized Linear Model for Principal Component Analysis of Binary Data , 2003, AISTATS.

[5]  D. Cox,et al.  Analysis of Binary Data (2nd ed.). , 1990 .

[6]  Ian Diamond,et al.  Analysis of Binary Data. 2nd Edn. , 1990 .

[7]  Sanjoy Dasgupta,et al.  A Generalization of Principal Components Analysis to the Exponential Family , 2001, NIPS.

[8]  Andrew R. Leach,et al.  An Introduction to Chemoinformatics , 2003 .

[9]  Hans-Peter Kriegel,et al.  Supervised probabilistic principal component analysis , 2006, KDD '06.

[10]  Robert P. Sheridan,et al.  Calculating Similarities between Biological Activities in the MDL Drug Data Report Database , 2004, J. Chem. Inf. Model..

[11]  H. L. Morgan The Generation of a Unique Machine Description for Chemical Structures-A Technique Developed at Chemical Abstracts Service. , 1965 .

[12]  P. McCullagh,et al.  Generalized Linear Models , 1992 .

[13]  Michael E. Tipping,et al.  Probabilistic Principal Component Analysis , 1999 .

[14]  S. Siegel,et al.  Nonparametric Statistics for the Behavioral Sciences , 2022, The SAGE Encyclopedia of Research Design.

[15]  R. Tibshirani,et al.  Prediction by Supervised Principal Components , 2006 .