The centroidal algorithm in molecular similarity and diversity calculations on confidential datasets

SummaryChemical structure provides exhaustive description of a compound, but it is often proprietary and thus an impediment in the exchange of information. For example, structure disclosure is often needed for the selection of most similar or dissimilar compounds. Authors propose a centroidal algorithm based on structural fragments (screens) that can be efficiently used for the similarity and diversity selections without disclosing structures from the reference set. For an increased security purposes, authors recommend that such set contains at least some tens of structures. Analysis of reverse engineering feasibility showed that the problem difficulty grows with decrease of the screen’s radius. The algorithm is illustrated with concrete calculations on known steroidal, quinoline, and quinazoline drugs. We also investigate a problem of scaffold identification in combinatorial library dataset. The results show that relatively small screens of radius equal to 2 bond lengths perform well in the similarity sorting, while radius 4 screens yield better results in diversity sorting. The software implementation of the algorithm taking SDF file with a reference set generates screens of various radii which are subsequently used for the similarity and diversity sorting of external SDFs. Since the reverse engineering of the reference set molecules from their screens has the same difficulty as the RSA asymmetric encryption algorithm, generated screens can be stored openly without further encryption. This approach ensures an end user transfers only a set of structural fragments and no other data. Like other algorithms of encryption, the centroid algorithm cannot give 100% guarantee of protecting a chemical structure from dataset, but probability of initial structure identification is very small-order of 10−40 in typical cases.

[1]  Nick A. Farmer,et al.  The CAS ONLINE search system. 1. General system design and selection, generation, and use of search screens , 1983, J. Chem. Inf. Comput. Sci..

[2]  John M. Barnard,et al.  Chemical Similarity Searching , 1998, J. Chem. Inf. Comput. Sci..

[3]  Sergei V. Trepalin,et al.  New Diversity Calculations Algorithms Used for Compound Selection , 2002, J. Chem. Inf. Comput. Sci..

[4]  W. Bremser Hose — a novel substructure code , 1978 .

[5]  Michael F. Lynch,et al.  Strategic Considerations in the Design of a Screening System for Substructure Searches of Chemical Structure Files , 1973 .

[6]  Jamieson M. Cobleigh,et al.  Evaluation of a 1H-13C NMR Spectral Library , 2001, J. Chem. Inf. Comput. Sci..

[7]  Nikolai S. Zefirov,et al.  WinDat: An NMR Database Compilation Tool, User Interface, and Spectrum Libraries for Personal Computers , 1995, J. Chem. Inf. Comput. Sci..

[8]  Wendy A. Warr,et al.  Commercial software systems for diversity analysis , 1996 .

[9]  P. Willett,et al.  A Fast Algorithm For Selecting Sets Of Dissimilar Molecules From Large Chemical Databases , 1995 .

[10]  H. Matter,et al.  Selecting optimally diverse compounds from structure databases: a validation study of two-dimensional and three-dimensional molecular descriptors. , 1997, Journal of medicinal chemistry.

[11]  Lu Xu,et al.  On Highly Discriminating Molecular Topological Index , 1996, J. Chem. Inf. Comput. Sci..

[12]  A. N. Jain,et al.  IcePick: a flexible surface-based system for molecular diversity. , 1999, Journal of medicinal chemistry.

[13]  Ramaswamy Nilakantan,et al.  Topological torsion: a new molecular descriptor for SAR applications. Comparison with other descriptors , 1987, J. Chem. Inf. Comput. Sci..

[14]  R. Venkataraghavan,et al.  Atom pairs as molecular features in structure-activity studies: definition and applications , 1985, J. Chem. Inf. Comput. Sci..

[15]  J. Gordon Strong RSA keys , 1984 .

[16]  Yvonne C. Martin,et al.  Use of Structure-Activity Data To Compare Structure-Based Clustering Methods and Descriptors for Use in Compound Selection , 1996, J. Chem. Inf. Comput. Sci..

[17]  John Figueras Computer Coding of Configuration , 1996, J. Chem. Inf. Comput. Sci..

[18]  Robert D Clark,et al.  Neighborhood behavior: a useful concept for validation of "molecular diversity" descriptors. , 1996, Journal of medicinal chemistry.

[19]  Marvin Johnson,et al.  Concepts and applications of molecular similarity , 1990 .

[20]  John Figueras,et al.  Ring Perception Using Breadth-First Search , 1996, J. Chem. Inf. Comput. Sci..

[21]  Arthur Dalby,et al.  Description of several chemical structure file formats used by computer programs developed at Molecular Design Limited , 1992, J. Chem. Inf. Comput. Sci..