Predictive cheminformatics in drug discovery: statistical modeling for analysis of micro-array and gene expression data.

The vast amounts of chemical and biological data available through robotic high-throughput assays and micro-array technologies require computational techniques for visualization, analysis, and predictive -modeling. Predictive cheminformatics and bioinformatics employ statistical methods to mine this data for hidden correlations and to retrieve molecules or genes with desirable biological activity from large databases, for the purpose of drug development. While many statistical methods are commonly employed and widely accessible, their proper use involves due consideration to data representation and preprocessing, model validation and domain of applicability estimation, similarity assessment, the nature of the structure-activity landscape, and model interpretation. This chapter seeks to review these considerations in light of the current state of the art in statistical modeling and to summarize the best practices in predictive cheminformatics.

[1]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[2]  D. E. Patterson,et al.  Crossvalidation, Bootstrapping, and Partial Least Squares Compared with Multiple Regression in Conventional QSAR Studies , 1988 .

[3]  M Suzuki,et al.  DNA recognition code of transcription factors in the helix-turn-helix, probe helix, hormone receptor, and zinc finger families. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Bernhard Schölkopf,et al.  New Support Vector Algorithms , 2000, Neural Computation.

[5]  Deborah R. Carvalho,et al.  A hybrid decision tree/genetic algorithm method for data mining , 2004, Inf. Sci..

[6]  A. Hopkins Network pharmacology: the next paradigm in drug discovery. , 2008, Nature chemical biology.

[7]  Curt M. Breneman,et al.  Transferable atom equivalent multicentered multipole expansion method , 2003, J. Comput. Chem..

[8]  Benjamin M. Good,et al.  The Life Sciences Semantic Web is Full of Creeps! , 2006, Briefings Bioinform..

[9]  A. Höskuldsson PLS regression methods , 1988 .

[10]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[11]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[12]  I. Muchnik,et al.  Prediction of protein folding class using global description of amino acid sequence. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[13]  H. Kono,et al.  Structure‐based prediction of DNA target sites by regulatory proteins , 1999, Proteins.

[14]  J A Swets,et al.  Better decisions through science. , 2000, Scientific American.

[15]  M Suzuki,et al.  A framework for the DNA-protein recognition code of the probe helix in transcription factors: the chemical and stereochemical rules. , 1994, Structure.

[16]  Ivonne M C M Rietjens,et al.  Promises and pitfalls of quantitative structure-activity relationship approaches for predicting metabolism and toxicity. , 2008, Chemical research in toxicology.

[17]  J. A. Grant,et al.  A shape-based 3-D scaffold hopping method and its application to a bacterial protein-protein interaction. , 2005, Journal of medicinal chemistry.

[18]  G Schneider,et al.  The rational design of amino acid sequences by artificial neural networks and simulated molecular evolution: de novo design of an idealized leader peptidase cleavage site. , 1994, Biophysical journal.

[19]  Bhaskar D. Kulkarni,et al.  Using pseudo amino acid composition to predict protein subnuclear localization: Approached with PSSM , 2007, Pattern Recognit. Lett..

[20]  X M Pan,et al.  Accurate Prediction of Protein Secondary Structural Content , 2001, Journal of protein chemistry.

[21]  Timothy Clark,et al.  QSAR and QSPR based solely on surface properties? , 2004, Journal of molecular graphics & modelling.

[22]  M. Aida,et al.  An ab initio molecular orbital study on the sequence-dependency of DNA conformation: an evaluation of intra- and inter-strand stacking interaction energy. , 1988, Journal of theoretical biology.

[23]  M. Randic,et al.  The connectivity index 25 years after. , 2001, Journal of molecular graphics & modelling.

[24]  Philip E. Bourne,et al.  SMAP-WS: a parallel web service for structural proteome-wide ligand-binding site comparison , 2010, Nucleic Acids Res..

[25]  R. Cramer,et al.  Comparative molecular field analysis (CoMFA). 1. Effect of shape on binding of steroids to carrier proteins. , 1988, Journal of the American Chemical Society.

[26]  S. Pickett,et al.  GRid-INdependent descriptors (GRIND): a novel class of alignment-independent three-dimensional molecular descriptors. , 2000, Journal of medicinal chemistry.

[27]  X. Chen,et al.  SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence , 2003, Nucleic Acids Res..

[28]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[29]  Tilmann Weber,et al.  Specificity prediction of adenylation domains in nonribosomal peptide synthetases (NRPS) using transductive support vector machines (TSVMs) , 2005, Nucleic acids research.

[30]  Lei Xie,et al.  Detecting evolutionary relationships across existing fold space, using sequence order-independent profile–profile alignments , 2008, Proceedings of the National Academy of Sciences.

[31]  S. Wold,et al.  The Collinearity Problem in Linear Regression. The Partial Least Squares (PLS) Approach to Generalized Inverses , 1984 .

[32]  Rajarshi Guha,et al.  Structure-Activity Landscape Index: Identifying and Quantifying Activity Cliffs , 2008, J. Chem. Inf. Model..

[33]  Svante Wold,et al.  Multivariate quantitative structure-activity relationships (QSAR): conditions for their applicability , 1983, J. Chem. Inf. Comput. Sci..

[34]  P. Meisel Margaret O. Dayhoff: Atlas of Protein Sequence and Structure 1969 (Volume 4) XXIV u. 361 S., 21 Ausklapptafeln, 68 Abb. und zahlreiche Tabellen. National Biomedical Research Foundation, Silver Spring/Maryland 1969. Preis $ 12,50 , 1971 .

[35]  F. Lombardo,et al.  Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings , 1997 .

[36]  Kristin P. Bennett,et al.  Support vector machines: hype or hallelujah? , 2000, SKDD.

[37]  R. Grantham Amino Acid Difference Formula to Help Explain Protein Evolution , 1974, Science.

[38]  Paola Gramatica,et al.  QSAR prediction of estrogen activity for a large set of diverse chemicals under the guidance of OECD principles. , 2006, Chemical research in toxicology.

[39]  Gergana Dimitrova,et al.  A Stepwise Approach for Defining the Applicability Domain of SAR and QSAR Models , 2005, J. Chem. Inf. Model..

[40]  J. Rao New scoring matrix for amino acid residue exchanges based on residue characteristic physical parameters. , 2009 .

[41]  Tingjun Hou,et al.  ADME Evaluation in Drug Discovery, 7. Prediction of Oral Absorption by Correlation and Classification , 2007, J. Chem. Inf. Model..

[42]  Rajarshi Guha,et al.  On the interpretation and interpretability of quantitative structure–activity relationship models , 2008, J. Comput. Aided Mol. Des..

[43]  Jonathan D. Hirst,et al.  New approaches to QSAR: Neural networks and machine learning , 1993 .

[44]  B. Masek,et al.  Molecular shape comparison of angiotensin II receptor antagonists. , 1993, Journal of medicinal chemistry.

[45]  Carole A. Goble,et al.  Ontology-based Knowledge Representation for Bioinformatics , 2000, Briefings Bioinform..

[46]  Timothy Clark,et al.  New Molecular Descriptors Based on Local Properties at the Molecular Surface and a Boiling-Point Model Derived from Them , 2004, J. Chem. Inf. Model..

[47]  Nikolay A. Kolchanov,et al.  CRASP: a program for analysis of coordinated substitutions in multiple alignments of protein sequences , 2004, Nucleic Acids Res..

[48]  Robert P. Sheridan,et al.  Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling , 2003, J. Chem. Inf. Comput. Sci..

[49]  Philip E. Bourne,et al.  Drug Discovery Using Chemical Systems Biology: Repositioning the Safe Medicine Comtan to Treat Multi-Drug and Extensively Drug Resistant Tuberculosis , 2009, PLoS Comput. Biol..

[50]  J. Dearden,et al.  How not to develop a quantitative structure–activity or structure–property relationship (QSAR/QSPR) , 2009, SAR and QSAR in environmental research.

[51]  Edgar Jacoby,et al.  Annotating and mining the ligand-target chemogenomics knowledge space , 2004 .

[52]  M. O. Dayhoff,et al.  Atlas of protein sequence and structure , 1965 .

[53]  David J. States,et al.  Conformational model for binding site recognition by the E.coli MetJ transcription factor , 2001, Bioinform..

[54]  Alexander Tropsha,et al.  Chemometric Analysis of Ligand Receptor Complementarity: Identifying Complementary Ligands Based on Receptor Information (CoLiBRI) , 2006, J. Chem. Inf. Model..

[55]  Olivier Bodenreider,et al.  Bio-ontologies: current trends and future directions , 2006, Briefings Bioinform..

[56]  Vladimir Batagelj,et al.  Comparison of three different approaches to the property prediction problem , 1994, J. Chem. Inf. Comput. Sci..

[57]  Jonathan D. Hirst,et al.  Similarity by Compression , 2007, J. Chem. Inf. Model..

[58]  Gerta Rücker,et al.  y-Randomization and Its Variants in QSPR/QSAR , 2007, J. Chem. Inf. Model..

[59]  Nina Nikolova-Jeliazkova,et al.  QSAR Applicability Domain Estimation by Projection of the Training Set in Descriptor Space: A Review , 2005, Alternatives to laboratory animals : ATLA.

[60]  N Sukumar,et al.  Bioinformatics and cheminformatics: where do the twain meet? , 2008, Current opinion in drug discovery & development.

[61]  Jordi Mestres,et al.  Computational chemogenomics approaches to systematic knowledge-based drug discovery. , 2004, Current opinion in drug discovery & development.

[62]  Sourav Das,et al.  Binding Affinity Prediction with Property-Encoded Shape Distribution Signatures , 2010, J. Chem. Inf. Model..

[63]  David Weininger,et al.  SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules , 1988, J. Chem. Inf. Comput. Sci..

[64]  R. Czerminski,et al.  Use of Support Vector Machine in Pattern Classification: Application to QSAR Studies , 2001 .

[65]  Y. Martin,et al.  Do structurally similar molecules have similar biological activity? , 2002, Journal of medicinal chemistry.

[66]  J. Topliss,et al.  Chance factors in studies of quantitative structure-activity relationships. , 1979, Journal of medicinal chemistry.

[67]  J. Taskinen,et al.  Neural network modeling for estimation of the aqueous solubility of structurally related drugs. , 1997, Journal of pharmaceutical sciences.

[68]  P. Geladi Notes on the history and nature of partial least squares (PLS) modelling , 1988 .

[69]  P. Jurs,et al.  Classification of multidrug-resistance reversal agents using structure-based descriptors and linear discriminant analysis. , 2000, Journal of medicinal chemistry.

[70]  Weida Tong,et al.  Decision Forest: Combining the Predictions of Multiple Independent Decision Tree Models , 2003, J. Chem. Inf. Comput. Sci..

[71]  Kuo-Chen Chou,et al.  Prediction of Membrane Protein Types by Incorporating Amphipathic Effects , 2005, J. Chem. Inf. Model..

[72]  H. Margalit,et al.  Quantitative parameters for amino acid-base interaction: implications for prediction of protein-DNA binding sites. , 1998, Nucleic acids research.

[73]  M. Kanehisa,et al.  Analysis of amino acid indices and mutation matrices for sequence comparison and structure prediction of proteins. , 1996, Protein engineering.

[74]  Guillermo Moyna,et al.  Shape signatures: a new approach to computer-aided ligand- and receptor-based drug design. , 2003, Journal of medicinal chemistry.

[75]  Yang Liu,et al.  An introduction to decision tree modeling , 2004 .

[76]  Mark J. Embrechts,et al.  New developments in PEST shape/property hybrid descriptors , 2003, J. Comput. Aided Mol. Des..

[77]  Bin Chen,et al.  PubChem BioAssays as a data source for predictive models. , 2010, Journal of molecular graphics & modelling.

[78]  Gerald M. Maggiora,et al.  On Outliers and Activity Cliffs-Why QSAR Often Disappoints , 2006, J. Chem. Inf. Model..

[79]  D. Rognan Chemogenomic approaches to rational drug design , 2007, British journal of pharmacology.

[80]  Satoru Miyano,et al.  Extensive feature detection of N-terminal protein sorting signals , 2002, Bioinform..

[81]  Mathias Wawer,et al.  Navigating structure-activity landscapes. , 2009, Drug discovery today.

[82]  Alexander Tropsha,et al.  Best Practices for QSAR Model Development, Validation, and Exploitation , 2010, Molecular informatics.

[83]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[84]  M. Cronin,et al.  Pitfalls in QSAR , 2003 .

[85]  David M. Rocke,et al.  Predicting ligand binding to proteins by affinity fingerprinting. , 1995, Chemistry & biology.

[86]  H. Macfie,et al.  An application of unsupervised neural network methodology Kohonen topology-Preserving mapping) to QSAR analysis , 1991 .

[87]  P. Roy,et al.  Exploring the impact of size of training sets for the development of predictive QSAR models , 2008 .

[88]  Emilio Benfenati,et al.  Definition and Detection of Outliers in Chemical Space , 2008, J. Chem. Inf. Model..

[89]  Gisbert Schneider,et al.  Kernel Approach to Molecular Similarity Based on Iterative Graph Similarity , 2007, J. Chem. Inf. Model..

[90]  Li Shao,et al.  Consensus Ranking Approach to Understanding the Underlying Mechanism With QSAR , 2010, J. Chem. Inf. Model..

[91]  T Scior,et al.  How to recognize and workaround pitfalls in QSAR studies: a critical review. , 2009, Current medicinal chemistry.

[92]  Rajarshi Guha,et al.  Utilizing high throughput screening data for predictive toxicology models: protocols and application to MLSCN assays , 2008, J. Comput. Aided Mol. Des..

[93]  Peter D. Karp,et al.  An ontology for biological function based on molecular interactions , 2000, Bioinform..

[94]  Alexander Golbraikh,et al.  Predictive QSAR modeling based on diversity sampling of experimental datasets for the training and test set selection , 2002, J. Comput. Aided Mol. Des..

[95]  William J. Welsh,et al.  Enrichment of Ligands for the Serotonin Receptor Using the Shape Signatures Approach , 2005, J. Chem. Inf. Model..

[96]  E. Jacoby,et al.  Chemogenomics: an emerging strategy for rapid target and drug discovery , 2004, Nature Reviews Genetics.

[97]  Paul W Finn,et al.  Ultrafast shape recognition: evaluating a new ligand-based virtual screening technology. , 2009, Journal of molecular graphics & modelling.

[98]  Dragos Horvath,et al.  Predicting ADME properties and side effects: the BioPrint approach. , 2003, Current opinion in drug discovery & development.

[99]  M. Michael Gromiha,et al.  Free-Energy Maps of Base−Amino Acid Interactions for DNA−Protein Recognition , 1999 .

[100]  N. Nikolova,et al.  International Union of Pure and Applied Chemistry, LUMO energy ± The Lowest Unoccupied Molecular Orbital (LUMO) , 2022 .

[101]  A. Fliri,et al.  Biological spectra analysis: Linking biological activity profiles to molecular structure. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[102]  Arun Krishnan,et al.  pSLIP: SVM based protein subcellular localization prediction using multiple physicochemical properties , 2005, BMC Bioinformatics.

[103]  Corwin Hansch,et al.  An approach toward the problem of outliers in QSAR. , 2005, Bioorganic & medicinal chemistry.

[104]  Curt M. Breneman,et al.  QTAIM in Drug Discovery and Protein Modeling , 2007 .

[105]  Paola Gramatica,et al.  The Importance of Being Earnest: Validation is the Absolute Essential for Successful Application and Interpretation of QSPR Models , 2003 .

[106]  Chun Yan,et al.  Prediction of protein subcellular location using a combined feature of sequence , 2005, FEBS letters.

[107]  T. Klabunde Chemogenomic approaches to drug discovery: similar receptors bind similar ligands , 2007, British journal of pharmacology.

[108]  Pedro J. Ballester,et al.  Ultrafast shape recognition for similarity search in molecular databases , 2007, Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[109]  Giuseppina C. Gini,et al.  The Importance of Scaling in Data Mining for Toxicity Prediction , 2002, J. Chem. Inf. Comput. Sci..

[110]  Jonathan D. Hirst,et al.  Contemporary QSAR Classifiers Compared , 2007, J. Chem. Inf. Model..

[111]  J. Bajorath,et al.  SAR index: quantifying the nature of structure-activity relationships. , 2007, Journal of medicinal chemistry.

[112]  Kristin P. Bennett,et al.  Prediction of peptide bonding affinity: kernel methods for nonlinear modeling , 2011, ArXiv.

[113]  Shinn-Ying Ho,et al.  POPI: predicting immunogenicity of MHC class I binding peptides by mining informative physicochemical properties , 2007, Bioinform..

[114]  J. Gasteiger,et al.  Autocorrelation of Molecular Surface Properties for Modeling Corticosteroid Binding Globulin and Cytosolic Ah Receptor Activity by Neural Networks , 1995 .

[115]  Roger A. Sayle,et al.  Lingos, Finite State Machines, and Fast Similarity Searching , 2006, J. Chem. Inf. Model..

[116]  Roman Rosipal,et al.  Kernel Partial Least Squares Regression in Reproducing Kernel Hilbert Space , 2002, J. Mach. Learn. Res..

[117]  Kuo-Chen Chou,et al.  Support vector machines for predicting HIV protease cleavage sites in protein , 2002, J. Comput. Chem..

[118]  Tomasz Arodz,et al.  Computational methods in developing quantitative structure-activity relationships (QSAR): a review. , 2006, Combinatorial chemistry & high throughput screening.

[119]  Yinglin Wang,et al.  Predicting the protein SUMO modification sites based on Properties Sequential Forward Selection (PSFS). , 2007, Biochemical and biophysical research communications.

[120]  Marvin Johnson,et al.  Concepts and applications of molecular similarity , 1990 .

[121]  Amos Bairoch,et al.  PROSITE: A Documented Database Using Patterns and Profiles as Motif Descriptors , 2002, Briefings Bioinform..

[122]  A. Tropsha,et al.  Beware of q2! , 2002, Journal of molecular graphics & modelling.

[123]  Jinbo Bi,et al.  Prediction of Protein Retention Times in Anion-Exchange Chromatography Systems Using Support Vector Regression , 2002, J. Chem. Inf. Comput. Sci..

[124]  C. Breneman,et al.  Prediction of protein retention in ion-exchange systems using molecular descriptors obtained from crystal structure. , 2001, Analytical chemistry.

[125]  Igor V. Tetko,et al.  Data modelling with neural networks: Advantages and limitations , 1997, J. Comput. Aided Mol. Des..

[126]  I. Muchnik,et al.  Recognition of a protein fold in the context of the Structural Classification of Proteins (SCOP) classification. , 1999, Proteins.

[127]  Ruedi Stoop,et al.  An Ontology for Pharmaceutical Ligands and Its Application for in Silico Screening and Library Design , 2002, J. Chem. Inf. Comput. Sci..

[128]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[129]  Jürgen Bajorath,et al.  Rationalizing Three-Dimensional Activity Landscapes and the Influence of Molecular Representations on Landscape Topology and the Formation of Activity Cliffs , 2010, J. Chem. Inf. Model..

[130]  Paola Gramatica,et al.  Principles of QSAR models validation: internal and external , 2007 .

[131]  J. Bajorath,et al.  Structure-activity relationship anatomy by network-like similarity graphs and local structure-activity relationship indices. , 2008, Journal of medicinal chemistry.

[132]  W. Graham Richards,et al.  Ultrafast shape recognition to search compound databases for similar molecular shapes , 2007, J. Comput. Chem..

[133]  Richard H. Lathrop,et al.  DNA sequence and structure: direct and indirect recognition in protein-DNA binding , 2002, ISMB.

[134]  Erik Johansson,et al.  Megavariate analysis of environmental QSAR data. Part I – A basic framework founded on principal component analysis (PCA), partial least squares (PLS), and statistical molecular design (SMD) , 2006, Molecular Diversity.

[135]  Bernard F. Buxton,et al.  Drug Design by Machine Learning: Support Vector Machines for Pharmaceutical Data Analysis , 2001, Comput. Chem..

[136]  M. Kanehisa,et al.  Cluster analysis of amino acid indices for prediction of protein structure and function. , 1988, Protein engineering.

[137]  Anna Vulpetti,et al.  Predicting Polypharmacology by Binding Site Similarity: From Kinases to the Protein Universe , 2010, J. Chem. Inf. Model..

[138]  G. V. Paolini,et al.  Global mapping of pharmacological space , 2006, Nature Biotechnology.

[139]  I. Muchnik,et al.  Recognition of a protein fold in the context of the SCOP classification , 1999 .

[140]  Corinna Kolárik,et al.  Information extraction in the life sciences: perspectives for medicinal chemistry, pharmacology and toxicology. , 2005, Current topics in medicinal chemistry.

[141]  Boris Mirkin,et al.  A Measure of Domain of Applicability for QSAR Modelling Based on Intelligent K-Means Clustering , 2007 .