Estimating Error Rates in Bioactivity Databases

Bioactivity databases are routinely used in drug discovery to look-up and, using prediction tools, to predict potential targets for small molecules. These databases are typically manually curated from patents and scientific articles. Apart from errors in the source document, the human factor can cause errors during the extraction process. These errors can lead to wrong decisions in the early drug discovery process. In the current work, we have compared bioactivity data from three large databases (ChEMBL, Liceptor, and WOMBAT) who have curated data from the same source documents. As a result, we are able to report error rate estimates for individual activity parameters and individual bioactivity databases. Small molecule structures have the greatest estimated error rate followed by target, activity value, and activity type. This order is also reflected in supplier-specific error rate estimates. The results are also useful in identifying data points for recuration. We hope the results will lead to a more widespread awareness among scientists on the frequencies and types of errors in bioactivity data.

[1]  Pekka Tiikkainen,et al.  Analysis of Commercial and Public Bioactivity Databases , 2012, J. Chem. Inf. Model..

[2]  Christian Kramer,et al.  QSARs, data and error in the modern age of drug discovery. , 2012, Current topics in medicinal chemistry.

[3]  Andreas Bender,et al.  Databases: Compound bioactivities go public , 2010 .

[4]  Sean Ekins,et al.  Towards a gold standard: regarding quality in public domain chemistry databases and approaches to improving the situation. , 2012, Drug discovery today.

[5]  A. Vulpetti,et al.  The experimental uncertainty of heterogeneous public K(i) data. , 2012, Journal of medicinal chemistry.

[6]  John P. Overington,et al.  Global Analysis of Small Molecule Binding to Related Protein Targets , 2012, PLoS Comput. Biol..

[7]  John P. Overington,et al.  ChEMBL: a large-scale bioactivity database for drug discovery , 2011, Nucleic Acids Res..

[8]  Peter Murray-Rust,et al.  Minimum information about a bioactive entity (MIABE) , 2011, Nature Reviews Drug Discovery.

[9]  Sorel Muresan,et al.  Quantitative assessment of the expanding complementarity between public and commercial databases of bioactive compounds , 2009, J. Cheminformatics.

[10]  Alexander Tropsha,et al.  Trust, But Verify: On the Importance of Chemical Structure Curation in Cheminformatics and QSAR Modeling Research , 2010, J. Chem. Inf. Model..

[11]  Sorel Muresan,et al.  Complementarity between public and commercial databases: new opportunities in medicinal chemistry informatics. , 2007, Current topics in medicinal chemistry.

[12]  Sean Ekins,et al.  A quality alert and call for improved curation of public chemistry databases. , 2011, Drug discovery today.