CerBeruS: A System Supporting the Sequential Screening Process

This paper describes the general design and application of CerBeruS, a computer-based system for supporting the process of sequential screening. CerBeruS stands for cluster-based selection, with cluster analysis forming the pivotal part of the system. CerBeruS uses the Ward's clustering method for partitioning the data set to be screened into smaller, more homogeneous subsets. One representative is picked from each subset and suggested as a screening candidate. Although the number of compounds submitted to screening is most often driven by the capacity of the assay, CerBeruS provides a statistical measure that computes the optimal number of clusters in the data set. This measure forms a point of reference for all screening experiments. Different hierarchies of subsets are stored in an Oracle database. Information about the size and content of a cluster can be retrieved from this database via a Visual Basic application. How these components work together in the CerBeruS system is demonstrated on a large data set. In addition, we show that, using the statistical measure, one can find an optimal trade-off between screening effort and number of hits.

[1]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[2]  M. Lajiness Dissimilarity-based compound selection techniques , 1996 .

[3]  J. Dunbar Cluster-based selection , 1996 .

[4]  Yvonne C. Martin,et al.  Use of Structure-Activity Data To Compare Structure-Based Clustering Methods and Descriptors for Use in Compound Selection , 1996, J. Chem. Inf. Comput. Sci..

[5]  Louis Hodes,et al.  Clustering a large number of compounds. 1. Establishing the method on an initial sample , 1989, J. Chem. Inf. Comput. Sci..

[6]  C. John Blankley,et al.  Comparison of 2D Fingerprint Types and Hierarchy Level Selection Methods for Structural Grouping Using Ward's Clustering , 2000, J. Chem. Inf. Comput. Sci..

[7]  H Matter,et al.  Random or rational design? Evaluation of diverse compound subsets from chemical structure databases. , 1998, Journal of medicinal chemistry.

[8]  Stephen D. Pickett,et al.  Partition-based selection , 1996 .

[9]  David T. Stanton,et al.  Application of Nearest-Neighbor and Cluster Analyses in Pharmaceutical Lead Discovery , 1999, J. Chem. Inf. Comput. Sci..

[10]  M. Boyd,et al.  New soluble-formazan assay for HIV-1 cytopathic effects: application to high-flux screening of synthetic and natural products for AIDS-antiviral activity. , 1989, Journal of the National Cancer Institute.

[11]  Denis M. Bayada,et al.  Molecular Diversity and Representativity in Chemical Databases. , 1999 .

[12]  L. Kelley,et al.  An automated approach for clustering an ensemble of NMR-derived protein structures into conformationally related subfamilies. , 1996, Protein engineering.