ChemModLab: A Web-Based Cheminformatics Modeling Laboratory

ChemModLab, written by the ECCR @ NCSU consortium under NIH support, is a toolbox for fitting and assessing quantitative structure-activity relationships (QSARs). Its elements are: a cheminformatic front end used to supply molecular descriptors for use in modeling; a set of methods for fitting models; and methods for validating the resulting model. Compounds may be input as structures from which standard descriptors will be calculated using the freely available cheminformatic front end PowerMV; PowerMV also supports compound visualization. In addition, the user can directly input their own choices of descriptors, so the capability for comparing descriptors is effectively unlimited. The statistical methodologies comprise a comprehensive collection of approaches whose validity and utility have been accepted by experts in the fields. As far as possible, these tools are implemented in open-source software linked into the flexible R platform, giving the user the capability of applying many different QSAR modeling methods in a seamless way. As promising new QSAR methodologies emerge from the statistical and data-mining communities, they will be incorporated in the laboratory. The web site also incorporates links to public-domain data sets that can be used as test cases for proposed new modeling methods. The capabilities of ChemModLab are illustrated using a variety of biological responses, with different modeling methodologies being applied to each. These show clear differences in quality of the fitted QSAR model, and in computational requirements. The laboratory is web-based, and use is free. Researchers with new assay data, a new descriptor set, or a new modeling method may readily build QSAR models and benchmark their results against other findings. Users may also examine the diversity of the molecules identified by a QSAR model. Moreover, users have the choice of placing their data sets in a public area to facilitate communication with other researchers; or can keep them hidden to preserve confidentiality.

[1]  Keith L. Peterson,et al.  Artificial Neural Networks and Their use in Chemistry , 2007 .

[2]  Ting Chen,et al.  Scalable Partitioning and Exploration of Chemical Spaces Using Geometric Hashing , 2006, J. Chem. Inf. Model..

[3]  Johann Gasteiger,et al.  Neural Networks for Chemists: An Introduction , 1993 .

[4]  Ramaswamy Nilakantan,et al.  Topological torsion: a new molecular descriptor for SAR applications. Comparison with other descriptors , 1987, J. Chem. Inf. Comput. Sci..

[5]  Anthony B. Atkinson,et al.  3. Measurement, Regression, and Calibration , 1995 .

[6]  R. Venkataraghavan,et al.  Atom pairs as molecular features in structure-activity studies: definition and applications , 1985, J. Chem. Inf. Comput. Sci..

[7]  C. Hansch,et al.  p-σ-π Analysis. A Method for the Correlation of Biological Activity and Chemical Structure , 1964 .

[8]  Alexander Tropsha,et al.  Novel Variable Selection Quantitative Structure-Property Relationship Approach Based on the k-Nearest-Neighbor Principle , 2000, J. Chem. Inf. Comput. Sci..

[9]  Robert P. Sheridan,et al.  Chemical Similarity Using Physiochemical Property Descriptors , 1996, J. Chem. Inf. Comput. Sci..

[10]  Yvonne C. Martin,et al.  Use of Structure-Activity Data To Compare Structure-Based Clustering Methods and Descriptors for Use in Compound Selection , 1996, J. Chem. Inf. Comput. Sci..

[11]  Robert S. Pearlman,et al.  Metric Validation and the Receptor-Relevant Subspace Concept , 1999, J. Chem. Inf. Comput. Sci..

[12]  Thomas Lengauer,et al.  Ensemble Methods for Classification in Cheminformatics , 2004, J. Chem. Inf. Model..

[13]  S. Wold,et al.  The Collinearity Problem in Linear Regression. The Partial Least Squares (PLS) Approach to Generalized Inverses , 1984 .

[14]  J. Sutherland,et al.  A comparison of methods for modeling quantitative structure-activity relationships. , 2004, Journal of medicinal chemistry.

[15]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[16]  F. Burden Molecular Identification Number for Substructure Searches. , 1989 .

[17]  Gerald T Ankley,et al.  Overview of data and conceptual approaches for derivation of quantitative structure‐activity relationships for ecotoxicological effects of organic chemicals , 2003, Environmental toxicology and chemistry.

[18]  Luc Morin-Allory,et al.  2D QSAR Consensus Prediction for High‐Throughput Virtual Screening. An Application to COX‐2 Inhibition Modeling and Screening of the NCI Database. , 2004 .

[19]  S. Free,et al.  A MATHEMATICAL CONTRIBUTION TO STRUCTURE-ACTIVITY STUDIES. , 1964, Journal of medicinal chemistry.

[20]  Jonathan D. Hirst,et al.  Contemporary QSAR Classifiers Compared , 2007, J. Chem. Inf. Model..

[21]  Yuanyuan Wang,et al.  Predictive Toxicology: Benchmarking Molecular Descriptors and Statistical Methods , 2003, J. Chem. Inf. Comput. Sci..

[22]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[23]  S. Stanley Young,et al.  PowerMV: A Software Environment for Molecular Viewing, Descriptor Generation, Data Analysis and Hit Evaluation. , 2005 .

[24]  I. Jolliffe Principal Component Analysis , 2002 .

[25]  S. Young,et al.  Analysis of a Large Structure/Biological Activity Data Set Using Recursive Partitioning. , 2000 .

[26]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[27]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[28]  M. Karelson Molecular descriptors in QSAR/QSPR , 2000 .

[29]  Christopher M. Bishop,et al.  Classification and regression , 1997 .

[30]  L. Gleser Measurement, Regression, and Calibration , 1996 .

[31]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[32]  Hugo Kubinyi,et al.  From Narcosis to Hyperspace: The History of QSAR , 2002 .

[33]  Bhupinder S. Dayal,et al.  Improved PLS algorithms , 1997 .

[34]  J. Neher A problem of multiple comparisons , 2011 .

[35]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[36]  Brian D. Ripley,et al.  Modern Applied Statistics with S Fourth edition , 2002 .

[37]  Roberto Todeschini,et al.  Handbook of Molecular Descriptors , 2002 .

[38]  Robert L. Lipnick,et al.  Charles Ernest Overton: narcosis studies and a contribution to general pharmacology , 1986 .

[39]  M. Forina,et al.  Multivariate calibration. , 2007, Journal of chromatography. A.

[40]  Haifeng Chen,et al.  Comparative Study of QSAR/QSPR Correlations Using Support Vector Machines, Radial Basis Function Neural Networks, and Multiple Linear Regression , 2004, J. Chem. Inf. Model..

[41]  Clyde Young Kramer,et al.  Extension of multiple range tests to group means with unequal numbers of replications , 1956 .

[42]  Gunnar Rätsch,et al.  Active Learning with Support Vector Machines in the Drug Discovery Process , 2003, J. Chem. Inf. Comput. Sci..

[43]  Brian D. Ripley,et al.  Pattern Recognition and Neural Networks , 1996 .

[44]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[45]  R. Carroll Measurement, Regression, and Calibration , 1994 .

[46]  S. D. Jong SIMPLS: an alternative approach to partial least squares regression , 1993 .

[47]  L B Kier,et al.  Derivation and significance of valence molecular connectivity. , 1981, Journal of pharmaceutical sciences.