Comparison of the NCI Open Database with Seven Large Chemical Structural Databases

Eight large chemical databases have been analyzed and compared to each other. Central to this comparison is the open National Cancer Institute (NCI) database, consisting of approximately 250 000 structures. The other databases analyzed are the Available Chemicals Directory ("ACD," from MDL, release 1.99, 3D-version); the ChemACX ("ACX," from CamSoft, Version 4.5); the Maybridge Catalog and the Asinex database (both as distributed by CamSoft as part of ChemInfo 4.5); the Sigma-Aldrich Catalog (CD-ROM, 1999 Version); the World Drug Index ("WDI," Derwent, version 1999.03); and the organic part of the Cambridge Crystallographic Database ("CSD," from Cambridge Crystallographic Data Center, 1999 Version 5.18). The database properties analyzed are internal duplication rates; compounds unique to each database; cumulative occurrence of compounds in an increasing number of databases; overlap of identical compounds between two databases; similarity overlap; diversity; and others. The crystallographic database CSD and the WDI show somewhat less overlap with the other databases than those with each other. In particular the collections of commercial compounds and compilations of vendor catalogs have a substantial degree of overlap among each other. Still, no database is completely a subset of any other, and each appears to have its own niche and thus "raison d'être". The NCI database has by far the highest number of compounds that are unique to it. Approximately 200 000 of the NCI structures were not found in any of the other analyzed databases.

[1]  George W. A. Milne,et al.  The NCI Drug Information System. 6. System maintenance , 1986, J. Chem. Inf. Comput. Sci..

[2]  Markus Wagener,et al.  Potential Drugs and Nondrugs: Prediction and Identification of Important Structural Features , 2000, J. Chem. Inf. Comput. Sci..

[3]  Jürgen Bajorath,et al.  Database Searching for Compounds with Similar Biological Activity Using Short Binary Bit String Representations of Molecules , 1999, J. Chem. Inf. Comput. Sci..

[4]  James B. Dunbar,et al.  Enhancing the diversity of a corporate database using chemical database clustering and analysis , 1995, J. Comput. Aided Mol. Des..

[5]  Arthur Dalby,et al.  Description of several chemical structure file formats used by computer programs developed at Molecular Design Limited , 1992, J. Chem. Inf. Comput. Sci..

[6]  John M. Barnard,et al.  Chemical Similarity Searching , 1998, J. Chem. Inf. Comput. Sci..

[7]  Wolf-Dietrich Ihlenfeldt,et al.  Computation and management of chemical properties in CACTVS: An extensible networked approach toward modularity and compatibility , 1994, J. Chem. Inf. Comput. Sci..

[8]  H. Kubinyi,et al.  A scoring scheme for discriminating between drugs and nondrugs. , 1998, Journal of medicinal chemistry.

[9]  George W. A. Milne,et al.  National Cancer Institute Drug Information System 3D Database , 1994, J. Chem. Inf. Comput. Sci..

[10]  George W. A. Milne,et al.  The NCI Drug Information System. 4. Inventory and shipping modules , 1986, J. Chem. Inf. Comput. Sci..

[11]  George W. A. Milne,et al.  The NCI Drug Information System. 3. The DIS chemistry module , 1986, J. Chem. Inf. Comput. Sci..

[12]  Lori B. Pfahler,et al.  Lead Discovery Using Stochastic Cluster Analysis (SCA): A New Method for Clustering Structurally Similar Compounds , 1998, J. Chem. Inf. Comput. Sci..

[13]  P. Willett,et al.  A Fast Algorithm For Selecting Sets Of Dissimilar Molecules From Large Chemical Databases , 1995 .

[14]  H. Matter,et al.  Selecting optimally diverse compounds from structure databases: a validation study of two-dimensional and three-dimensional molecular descriptors. , 1997, Journal of medicinal chemistry.

[15]  George W. A. Milne,et al.  The NCI Drug Information System. 1. System overview , 1986, J. Chem. Inf. Comput. Sci..

[16]  Yvonne C. Martin,et al.  The Information Content of 2D and 3D Structural Descriptors Relevant to Ligand-Receptor Binding , 1997, J. Chem. Inf. Comput. Sci..

[17]  Arup K. Ghose,et al.  Atomic physicochemical parameters for three dimensional structure directed quantitative structure-activity relationships. 4. Additional parameters for hydrophobic and dispersive interactions and their application for an automated superposition of certain naturally occurring nucleoside antibiotics , 1989, J. Chem. Inf. Comput. Sci..

[18]  Robert D. Clark,et al.  OptiSim: An Extended Dissimilarity Selection Method for Finding Diverse Representative Subsets , 1997, J. Chem. Inf. Comput. Sci..

[19]  David J. Cummins,et al.  Molecular Diversity in Chemical Databases: Comparison of Medicinal Chemistry Knowledge Bases and Databases of Commercially Available Compounds , 1996, J. Chem. Inf. Comput. Sci..

[20]  George W. A. Milne,et al.  The NCI Drug Information System. 2. DIS pre-registry , 1986, J. Chem. Inf. Comput. Sci..

[21]  Ramaswamy Nilakantan,et al.  Database diversity assessment: New ideas, concepts, and tools , 1997, J. Comput. Aided Mol. Des..

[22]  George W. A. Milne,et al.  The NCI Drug Information System. 5. DIS biology module , 1986, J. Chem. Inf. Comput. Sci..

[23]  Yukio Tominaga,et al.  Data Structure Comparison Using Box Counting Analysis , 1998, J. Chem. Inf. Comput. Sci..

[24]  Alexander Golbraikh,et al.  Comparison of chemical databases : Analysis of molecular diversity with self Organising maps (SOM) , 1998 .

[25]  Malcolm J. McGregor,et al.  Clustering of Large Databases of Compounds: Using the MDL "Keys" as Structural Descriptors , 1997, J. Chem. Inf. Comput. Sci..

[26]  Yvonne C. Martin,et al.  Use of Structure-Activity Data To Compare Structure-Based Clustering Methods and Descriptors for Use in Compound Selection , 1996, J. Chem. Inf. Comput. Sci..