Increasing the performance, trustworthiness and practical value of machine learning models: a case study predicting hydrogen bond network dimensionalities from molecular diagrams

The performance of a model is dependent on the quality and information content of the data used to build it. By applying machine learning approaches to a standard chemical dataset, we developed a 4-class classification algorithm that is able to predict the hydrogen bond network dimensionality that a molecule would adopt in its crystal form with an accuracy of 59% (in comparison to a 25% random threshold), exclusively from two and lower dimensional molecular descriptors. Although better than random, the performance level achieved by the model did not meet the standards for its reliable application. The practical value of our model was improved by wrapping the model around a confidence tool that increases model robustness, quantifies prediction trust, and allows one to operate a classifier virtually up to any accuracy level. Using this tool, the performance of the model could be improved up to 73% or 89% with the compromise that only 34% and 8% of the total set of test examples could be predicted. We anticipate that the ability to adjust the performance of reliable 2D based models to the requirements of its different applications may increase their practical value, making them suitable to tasks that range from initial virtual library filtering to profile specific compound identification.

[1]  Claire S. Adjiman,et al.  Report on the sixth blind test of organic crystal structure prediction methods , 2016, Acta crystallographica Section B, Structural science, crystal engineering and materials.

[2]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[3]  Frank R Burden,et al.  Quantitative structure-property relationship modeling of diverse materials properties. , 2012, Chemical reviews.

[4]  Jean-Charles de Hemptinne,et al.  Industrial Requirements for Thermodynamics and Transport Properties , 2010 .

[5]  Changquan Calvin Sun,et al.  Understanding the relationship between crystal structure, plasticity and compaction behaviour of theophylline, methyl gallate, and their 1:1 co-crystal , 2010 .

[6]  Thomas Blaschke,et al.  Molecular de-novo design through deep reinforcement learning , 2017, Journal of Cheminformatics.

[7]  Richard A. Sykes,et al.  Predicting mechanical properties of crystalline materials through topological analysis , 2018 .

[8]  Chih-Jen Lin,et al.  Probability Estimates for Multi-class Classification by Pairwise Coupling , 2003, J. Mach. Learn. Res..

[9]  P. York,et al.  Prediction of the Mechanical Behaviour of Crystalline Solids , 2011, Pharmaceutical Research.

[10]  Paul Raccuglia,et al.  Machine-learning-assisted materials discovery using failed experiments , 2016, Nature.

[11]  Masataka Kuroda,et al.  A novel descriptor based on atom-pair properties , 2017, Journal of Cheminformatics.

[12]  Robert P Sheridan,et al.  Why do we need so many chemical similarity search methods? , 2002, Drug discovery today.

[13]  Alexander Tropsha,et al.  Trust, But Verify: On the Importance of Chemical Structure Curation in Cheminformatics and QSAR Modeling Research , 2010, J. Chem. Inf. Model..

[14]  Anton J. Hopfinger,et al.  4D-QSAR: Perspectives in Drug Design , 2010, Molecules.

[15]  Changquan Calvin Sun,et al.  On the identification of slip planes in organic crystals based on attachment energy calculation. , 2008, Journal of pharmaceutical sciences.

[16]  K. Asadpour‐Zeynali,et al.  Comparison of Different 2D and 3D-QSAR Methods on Activity Prediction of Histamine H3 Receptor Antagonists , 2012, Iranian journal of pharmaceutical research : IJPR.

[17]  Danishuddin,et al.  Descriptors and their selection methods in QSAR analysis: paradigm for drug design. , 2016, Drug discovery today.

[18]  Gerta Rücker,et al.  y-Randomization and Its Variants in QSPR/QSAR , 2007, J. Chem. Inf. Model..

[19]  Lazaros Mavridis,et al.  Comprehensive Comparison of Ligand-Based Virtual Screening Tools Against the DUD Data set Reveals Limitations of Current 3D Methods , 2010, J. Chem. Inf. Model..

[20]  Robert J. Meier A Way towards Reliable Predictive Methods for the Prediction of Physicochemical Properties of Chemicals Using the Group Contribution and other Methods , 2019 .

[21]  Kevin J. Roberts,et al.  “Particle Informatics”: Advancing Our Understanding of Particle Properties through Digital Design , 2019, Crystal Growth & Design.

[22]  R. S. Payne,et al.  The mechanical properties of two forms of primidone predicted from their crystal structures , 1996 .

[23]  Tudor I. Oprea On the information content of 2D and 3D descriptors for QSAR , 2002 .

[24]  I. Bruno,et al.  Cambridge Structural Database , 2002 .

[25]  Russ B Altman,et al.  Machine learning in chemoinformatics and drug discovery. , 2018, Drug discovery today.