High-throughput screening is an increasingly popular approach to finding biologically active compounds. However, screening hit sets inevitably contain large numbers of compounds that have little likelihood of being developed into drugs. In order not to waste additional resources pursuing these leads, Baell and Holloway recently published a list of structural features to help identify problematic structures that generate frequent false positives across screening campaigns (pan-assay interference structures – PAINS). The work has generated considerable attention from industrial and academic communities. The structural features filter list was published in Sybyl Line Notation (SLN) format, a format only useable by the proprietary Sybyl software package. Efforts by Guha to convert these SLN filters to the SMARTS format, using the CACTVS toolkits (Xemistry, GmbH), so that the filters could be used in a broader range of software packages were recently published on his web site. As this was an automated conversion, concern has been expressed that these SMARTS filters will not deliver exactly the same structural matches as the original SLN filters, thereby including structures in a screening set already known to fail the PAINS filters. While there might be some mismatch between the SLN and SMARTS definitions for a pattern, we also note that different cheminformatics toolkits can make different assumptions about molecular structures. For example, different toolkits may have different assumptions for the aromaticity model, which can affect which molecules match an aromatic query. As a result of this, the SMARTS matching process using different toolkits, can sometimes lead to different results between each other as well as from the analysis performed using the original SLN filters. For example, Lagorce et al. found that it was necessary to manually adjust the SMARTS filters when used with the OpenBabel library in order to obtain the same matches as obtained when using the SLN filters. In addition, pre-processing options during import of the target structure list, such as aromatization, desalting, or protonation, might also be the cause of different results. Hence, we wished to make use of an open platform to test the SMARTS filters and allow the chemical community to benchmark different chemistry software packages in an intuitive manner. To this end, we chose the open source and freely available Konstanz Information Miner (KNIME http://knime.org). This is a data analysis platform consisting of a GUI workflow, or ‘pipeline’, interface containing several chemistry related nodes. Workflows can be exported and distributed freely to other users and work across the three platforms currently supported by KNIME (Linux, MacOS, Windows). KNIME is distributed with the Chemistry Development Kit (CDK) and, recently, the RDKit and Indigo software packages. KNIME can access additional chemistry software packages by use of its ‘external tool’ node, and a number of vendors provide nodes to access their proprietary software packages. However, a disadvantage of KNIME is that only basic settings and functions may be accessible for the included chemistry packages, as advanced functionality has been sacrificed for ease of use. Further, the distributed packages are often not the most recently available versions and may contain bugs that have already been corrected in more current versions. These shortcomings can be overcome by accessing recent versions of the packages using the built-in ‘external tool’ node, but this introduces an extra level of complexity for the user.
[1]
J. Baell,et al.
New substructure filters for removal of pan assay interference compounds (PAINS) from screening libraries and for their exclusion in bioassays.
,
2010,
Journal of medicinal chemistry.
[2]
David DeCaprio,et al.
Cheminformatics approaches to analyze diversity in compound screening libraries.
,
2010,
Current opinion in chemical biology.
[3]
Egon L. Willighagen,et al.
The Blue Obelisk—Interoperability in Chemical Informatics
,
2006,
J. Chem. Inf. Model..
[4]
Sean Ekins,et al.
Meta-analysis of molecular property patterns and filtering of public datasets of antimalarial “hits” and drugs
,
2010
.
[5]
Jennifer Venhorst,et al.
Design of a high fragment efficiency library by molecular graph theory.
,
2010,
ACS medicinal chemistry letters.
[6]
Melissa S. Cline,et al.
Using bioinformatics to predict the functional impact of SNVs
,
2011,
Bioinform..
[7]
Sean Ekins,et al.
Analysis and hit filtering of a very large library of compounds screened against Mycobacterium tuberculosis.
,
2010,
Molecular bioSystems.
[8]
Adrian Whitty,et al.
Growing PAINS in academic drug discovery.
,
2011,
Future medicinal chemistry.
[9]
C. Steinbeck,et al.
Recent developments of the chemistry development kit (CDK) - an open-source java library for chemo- and bioinformatics.
,
2006,
Current pharmaceutical design.
[10]
Stanislav Gobec,et al.
False positives in the early stages of drug discovery.
,
2010,
Current medicinal chemistry.
[11]
James Inglese,et al.
Apparent activity in high-throughput screening: origins of compound-dependent assay interference.
,
2010,
Current opinion in chemical biology.
[12]
Ian Collins,et al.
Probing the Probes: Fitness Factors For Small Molecule Tools
,
2010,
Chemistry & biology.
[13]
Egon L. Willighagen,et al.
The Chemistry Development Kit (CDK): An Open-Source Java Library for Chemo-and Bioinformatics
,
2003,
J. Chem. Inf. Comput. Sci..