Self organising hypothesis networks: a new approach for representing and structuring SAR knowledge

BackgroundCombining different sources of knowledge to build improved structure activity relationship models is not easy owing to the variety of knowledge formats and the absence of a common framework to interoperate between learning techniques. Most of the current approaches address this problem by using consensus models that operate at the prediction level. We explore the possibility to directly combine these sources at the knowledge level, with the aim to harvest potentially increased synergy at an earlier stage. Our goal is to design a general methodology to facilitate knowledge discovery and produce accurate and interpretable models.ResultsTo combine models at the knowledge level, we propose to decouple the learning phase from the knowledge application phase using a pivot representation (lingua franca) based on the concept of hypothesis. A hypothesis is a simple and interpretable knowledge unit. Regardless of its origin, knowledge is broken down into a collection of hypotheses. These hypotheses are subsequently organised into hierarchical network. This unification permits to combine different sources of knowledge into a common formalised framework. The approach allows us to create a synergistic system between different forms of knowledge and new algorithms can be applied to leverage this unified model. This first article focuses on the general principle of the Self Organising Hypothesis Network (SOHN) approach in the context of binary classification problems along with an illustrative application to the prediction of mutagenicity.ConclusionIt is possible to represent knowledge in the unified form of a hypothesis network allowing interpretable predictions with performances comparable to mainstream machine learning techniques. This new approach offers the potential to combine knowledge from different sources into a common framework in which high level reasoning and meta-learning can be applied; these latter perspectives will be explored in future work.

[1]  Carol A Marchant,et al.  In Silico Tools for Sharing Data and Knowledge on Toxicity and Metabolism: Derek for Windows, Meteor, and Vitic , 2008, Toxicology mechanisms and methods.

[2]  Stephen Muggleton,et al.  Inductive Logic Programming: Issues, Results and the Challenge of Learning Language in Logic , 1999, Artif. Intell..

[3]  Lars Carlsson,et al.  Beyond the Scope of Free-Wilson Analysis: Building Interpretable QSAR Models with Machine Learning Algorithms , 2013, J. Chem. Inf. Model..

[4]  Igor Kononenko,et al.  An overview of advances in reliability estimation of individual predictions in machine learning , 2009, Intell. Data Anal..

[5]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[6]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[7]  Andreas Zell,et al.  Interpreting linear support vector machine models with heat map molecule coloring , 2011, J. Cheminformatics.

[8]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[9]  Igor V. Tetko,et al.  Critical Assessment of QSAR Models of Environmental Toxicity against Tetrahymena pyriformis: Focusing on Applicability Domain and Overfitting by Variable Selection , 2008, J. Chem. Inf. Model..

[10]  R. Saracci,et al.  Describing the validity of carcinogen screening tests. , 1979, British Journal of Cancer.

[11]  Scott Boyer,et al.  Interpretation of Nonlinear QSAR Models Applied to Ames Mutagenicity Data , 2009, J. Chem. Inf. Model..

[12]  Gergana Dimitrova,et al.  A Stepwise Approach for Defining the Applicability Domain of SAR and QSAR Models , 2005, J. Chem. Inf. Model..

[13]  Matthias Rarey,et al.  Feature trees: A new molecular similarity measure based on tree matching , 1998, J. Comput. Aided Mol. Des..

[14]  Stefan Wetzel,et al.  The Scaffold Tree - Visualization of the Scaffold Universe by Hierarchical Scaffold Classification , 2007, J. Chem. Inf. Model..

[15]  Philip N. Judson,et al.  Assessing confidence in predictions made by knowledge-based systems , 2013 .

[16]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[17]  Judith C. Madden,et al.  Assessment of Methods To Define the Applicability Domain of Structural Alert Models , 2011, J. Chem. Inf. Model..

[18]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[19]  Robert E. Schapire,et al.  A Brief Introduction to Boosting , 1999, IJCAI.

[20]  Claude E. Shannon,et al.  The Mathematical Theory of Communication , 1950 .

[21]  Center for Food Safety and Applied Nutrition. , 1997, Nutrition reviews.

[22]  John Bradshaw,et al.  Similarity Searching Using Reduced Graphs , 2003, J. Chem. Inf. Comput. Sci..

[23]  Roberto Todeschini,et al.  Comparison of Different Approaches to Define the Applicability Domain of QSAR Models , 2012, Molecules.

[24]  Paola Gramatica,et al.  Methods for reliability and uncertainty assessment and for applicability evaluations of classification- and regression-based QSARs. , 2003, Environmental health perspectives.

[25]  Jürgen Bajorath,et al.  Exploring activity cliffs in medicinal chemistry. , 2012, Journal of medicinal chemistry.

[26]  Igor V. Tetko,et al.  Combinatorial QSAR Modeling of Chemical Toxicants Tested against Tetrahymena pyriformis , 2008, J. Chem. Inf. Model..

[27]  Daniel J. Warner,et al.  Matched molecular pairs as a medicinal chemistry tool. , 2011, Journal of medicinal chemistry.

[28]  N. Altman An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression , 1992 .

[29]  Rudolf Wille,et al.  Formal Concept Analysis as Mathematical Theory of Concepts and Concept Hierarchies , 2005, Formal Concept Analysis.

[30]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[31]  Igor I. Baskin,et al.  Machine Learning Methods for Property Prediction in Chemoinformatics: Quo Vadis? , 2012, J. Chem. Inf. Model..

[32]  Mathias Wawer,et al.  Systematic Extraction of Structure–Activity Relationship Information from Biological Screening Data , 2009, ChemMedChem.

[33]  Clayton Springer,et al.  An investigation into pharmaceutically relevant mutagenicity data and the influence on Ames predictive potential , 2011, J. Cheminformatics.