CFam: a chemical families database based on iterative selection of functional seeds and seed-directed compound clustering

Similarity-based clustering and classification of compounds enable the search of drug leads and the structural and chemogenomic studies for facilitating chemical, biomedical, agricultural, material and other industrial applications. A database that organizes compounds into similarity-based as well as scaffold-based and property-based families is useful for facilitating these tasks. CFam Chemical Family database http://bidd2.cse.nus.edu.sg/cfam was developed to hierarchically cluster drugs, bioactive molecules, human metabolites, natural products, patented agents and other molecules into functional families, superfamilies and classes of structurally similar compounds based on the literature-reported high, intermediate and remote similarity measures. The compounds were represented by molecular fingerprint and molecular similarity was measured by Tanimoto coefficient. The functional seeds of CFam families were from hierarchically clustered drugs, bioactive molecules, human metabolites, natural products, patented agents, respectively, which were used to characterize families and cluster compounds into families, superfamilies and classes. CFam currently contains 11 643 classes, 34 880 superfamilies and 87 136 families of 490 279 compounds (1691 approved drugs, 1228 clinical trial drugs, 12 386 investigative drugs, 262 881 highly active molecules, 15 055 human metabolites, 80 255 ZINC-processed natural products and 116 783 patented agents). Efforts will be made to further expand CFam database and add more functional categories and families based on other types of molecular representations.

[1]  George Papadatos,et al.  The ChEMBL bioactivity database: an update , 2013, Nucleic Acids Res..

[2]  Ryan G. Coleman,et al.  ZINC: A Free Tool to Discover Chemistry for Biology , 2012, J. Chem. Inf. Model..

[3]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[4]  Herbert Waldmann,et al.  Biology-oriented synthesis: harnessing the power of evolution. , 2014, Journal of the American Chemical Society.

[5]  Ian Sillitoe,et al.  Extending CATH: increasing coverage of the protein structure universe and linking structure with function , 2010, Nucleic Acids Res..

[6]  Sereina Riniker,et al.  Heterogeneous Classifier Fusion for Ligand-Based Virtual Screening: Or, How Decision Making by Committee Can Be a Good Thing , 2013, J. Chem. Inf. Model..

[7]  Gemma L Thomas,et al.  Natural product-like synthetic libraries. , 2011, Current opinion in chemical biology.

[8]  Yossef Kliger,et al.  Improving Classical Substructure-Based Virtual Screening to Handle Extrapolation Challenges , 2012, J. Chem. Inf. Model..

[9]  Christian Lemmen,et al.  Similarity searching and scaffold hopping in synthetically accessible combinatorial chemistry spaces. , 2008, Journal of medicinal chemistry.

[10]  Gisbert Schneider,et al.  NIPALSTREE: A New Hierarchical Clustering Approach for Large Compound Libraries and Its Application to Virtual Screening , 2006, J. Chem. Inf. Model..

[11]  R. Durbin,et al.  Pfam: A comprehensive database of protein domain families based on seed alignments , 1997, Proteins.

[12]  Weizhong Li A Fast Clustering Algorithm for Analyzing Highly Similar Compounds of Very Large Libraries , 2006, J. Chem. Inf. Model..

[13]  Jürgen Bajorath,et al.  Development of a Compound Class-Directed Similarity Coefficient That Accounts for Molecular Complexity Effects in Fingerprint Searching , 2009, J. Chem. Inf. Model..

[14]  Jürgen Bajorath,et al.  Rationalizing Structure and Target Relationships between Current Drugs , 2012, The AAPS Journal.

[15]  Gisbert Schneider,et al.  A Hierarchical Clustering Approach for Large Compound Libraries , 2005, J. Chem. Inf. Model..

[16]  Stefan Wetzel,et al.  Interactive exploration of chemical space with Scaffold Hunter. , 2009, Nature chemical biology.

[17]  CHUN WEI YAP,et al.  PaDEL‐descriptor: An open source software to calculate molecular descriptors and fingerprints , 2011, J. Comput. Chem..

[18]  Alan Bridge,et al.  New and continuing developments at PROSITE , 2012, Nucleic Acids Res..

[19]  Jacek Tabor,et al.  Asymmetric Clustering Index in a Case Study of 5-HT1A Receptor Ligands , 2014, PloS one.

[20]  P. M. Dean,et al.  Molecular Similarity in Drug Design , 2007 .

[21]  P. Willett,et al.  Promoting Access to White Rose Research Papers Similarity-based Virtual Screening Using 2d Fingerprints , 2022 .

[22]  Peter Willett,et al.  Similarity-based virtual screening using 2D fingerprints. , 2006, Drug discovery today.

[23]  Jürgen Bajorath,et al.  Anatomy of Fingerprint Search Calculations on Structurally Diverse Sets of Active Compounds , 2005, J. Chem. Inf. Model..

[24]  N. Nikolova,et al.  International Union of Pure and Applied Chemistry, LUMO energy ± The Lowest Unoccupied Molecular Orbital (LUMO) , 2022 .

[25]  R. Glen,et al.  Molecular similarity: a key technique in molecular informatics. , 2004, Organic & biomolecular chemistry.

[26]  Stefan Wetzel,et al.  Charting, navigating, and populating natural product chemical space for drug discovery. , 2012, Journal of medicinal chemistry.

[27]  Vincent Le Guilloux,et al.  Visual Characterization and Diversity Quantification of Chemical Libraries: 1. Creation of Delimited Reference Chemical Subspaces , 2011, J. Chem. Inf. Model..

[28]  Jürgen Bajorath,et al.  Database Searching for Compounds with Similar Biological Activity Using Short Binary Bit String Representations of Molecules , 1999, J. Chem. Inf. Comput. Sci..

[29]  Gabriele Cruciani,et al.  Suitability of molecular descriptors for database mining. A comparative analysis. , 2002, Journal of medicinal chemistry.

[30]  Chris Morley,et al.  Open Babel: An open chemical toolbox , 2011, J. Cheminformatics.

[31]  Michael T. M. Emmerich,et al.  A novel chemogenomics analysis of G protein-coupled receptors (GPCRs) and their ligands: a potential strategy for receptor de-orphanization , 2010, BMC Bioinformatics.

[32]  Y. Martin,et al.  Do structurally similar molecules have similar biological activity? , 2002, Journal of medicinal chemistry.

[33]  Gerhard Klebe,et al.  Successful virtual screening for novel inhibitors of human carbonic anhydrase: strategy and experimental confirmation. , 2002, Journal of medicinal chemistry.

[34]  Dimitris K. Agrafiotis,et al.  A Cluster-Based Strategy for Assessing the Overlap between Large Chemical Libraries and Its Application to a Recent Acquisition , 2006, J. Chem. Inf. Model..

[35]  Stefan Günther,et al.  SuperPred: drug classification and target prediction , 2008, Nucleic Acids Res..

[36]  Feng Xu,et al.  Therapeutic target database update 2014: a resource for targeted therapeutics , 2013, Nucleic Acids Res..

[37]  Robert P Sheridan,et al.  Why do we need so many chemical similarity search methods? , 2002, Drug discovery today.

[38]  Yvonne C. Martin,et al.  The Information Content of 2D and 3D Structural Descriptors Relevant to Ligand-Receptor Binding , 1997, J. Chem. Inf. Comput. Sci..

[39]  G. Makara,et al.  Measuring molecular similarity and diversity: total pharmacophore diversity. , 2001, Journal of medicinal chemistry.

[40]  Guixia Liu,et al.  Performance Evaluation of 2D Fingerprint and 3D Shape Similarity Methods in Virtual Screening , 2012, J. Chem. Inf. Model..

[41]  Peter Willett,et al.  Similarity Searching in Files of Three-Dimensional Chemical Structures: Analysis of the BIOSTER Database Using Two-Dimensional Fingerprints and Molecular Field Descriptors , 2000, J. Chem. Inf. Comput. Sci..

[42]  Stefan Wetzel,et al.  Bioactivity-guided mapping and navigation of chemical space. , 2009, Nature chemical biology.

[43]  John M. Barnard,et al.  Chemical Similarity Searching , 1998, J. Chem. Inf. Comput. Sci..

[44]  David S. Wishart,et al.  DrugBank 4.0: shedding new light on drug metabolism , 2013, Nucleic Acids Res..

[45]  Tim J. P. Hubbard,et al.  Data growth and its impact on the SCOP database: new developments , 2007, Nucleic Acids Res..

[46]  Jürgen Bajorath,et al.  Exploring structure–selectivity relationships of biogenic amine GPCR antagonists using similarity searching and dynamic compound mapping , 2008, Molecular Diversity.

[47]  Yanli Wang,et al.  PubChem: Integrated Platform of Small Molecules and Biological Activities , 2008 .

[48]  Maria F. Sassano,et al.  A Pharmacological Organization of G Protein-coupled Receptors , 2012, Nature Methods.

[49]  J. Medina-Franco,et al.  Expanding the medicinally relevant chemical space with compound libraries. , 2012, Drug discovery today.

[50]  H. Matter,et al.  Selecting optimally diverse compounds from structure databases: a validation study of two-dimensional and three-dimensional molecular descriptors. , 1997, Journal of medicinal chemistry.

[51]  Robert D. Finn,et al.  InterPro in 2011: new developments in the family and domain prediction database , 2011, Nucleic acids research.

[52]  Tudor I. Oprea,et al.  Cross-pharmacology analysis of G protein-coupled receptors. , 2011, Current topics in medicinal chemistry.

[53]  A. Hopkins,et al.  Navigating chemical space for biology and medicine , 2004, Nature.

[54]  Robert J. Jilek,et al.  "Lead hopping". Validation of topomer similarity as a superior predictor of similar biological activities. , 2004, Journal of medicinal chemistry.

[55]  Jürgen Bajorath,et al.  Molecular similarity analysis in virtual screening: foundations, limitations and novel approaches. , 2007, Drug discovery today.

[56]  David S. Wishart,et al.  HMDB 3.0—The Human Metabolome Database in 2013 , 2012, Nucleic Acids Res..

[57]  Andreas Bender,et al.  How Similar Are Similarity Searching Methods? A Principal Component Analysis of Molecular Descriptor Space , 2009, J. Chem. Inf. Model..

[58]  Adam Godzik,et al.  Clustering of highly homologous sequences to reduce the size of large protein databases , 2001, Bioinform..