Combination of Similarity Rankings Using Data Fusion

■ INTRODUCTION Similarity searching is one of the most common techniques for ligand-based virtual screening and involves scanning a chemical database to identify those molecules that are most similar to a user-defined reference structure using some quantitative measure of intermolecular structural similarity. The similar property principle states that molecules that are structurally similar are also likely to exhibit similar properties, and hence ranking a database in order of decreasing similarity with a bioactive reference structure is expected to highlight database structures that have high a priori probabilities of exhibiting the same activity. Any similarity measure has three principal components: the representation that is used to describe each of the structures that are to be considered; the weighting scheme that is used to assign weights to different parts of the structure representation that reflect their relative degrees of importance; and the similarity coefficient that is used to quantify the degree of resemblance between two suitably weighted representations. Multiple approaches have been described for each of these three components, resulting in a potentially vast number of similarity measures that could be used for virtual screening. This multiplicity has resulted in many studies that seek to determine which measures are most effective using some quantitative measure of screening performance. While certain types of measures are known to provide a reasonable level of retrieval effectiveness (e.g., those making use of Pipeline Pilot ECFP_4 fingerprints, of occurrence-based frequency weighting, or of the Tanimoto similarity coefficient), there is a general recognition that there is no single similarity measure that will provide optimal screening in all circumstances. This situation has been well summarized by Sheridan and Kearsley, when they note that “we have come to regard looking for ‘the best’ way of searching chemical databases as a futile exercise. In both retrospective and prospective studies, different methods select different subsets of actives for the same biological activity and the same method might work better on some activities than others”. Given that there is no single, consistently effective, similarity searching method that can be used to rank a database in decreasing similarity order, it has been suggested that multiple searches should be carried out. The results of these individual searches are then combined (or merged or fused) into a single ranking that is the final output presented to the user for subsequent compound selection and biological testing. Such combination approaches have been used not only in similarity searching and other types of ligand-based virtual screening, where they are normally referred to as data fusion, but also in structure-based virtual screening, where they are normally referred to as consensus scoring. There is hence much interest in combining these two approaches to virtual screening. This perspective discusses the use of data fusion in similaritybased virtual screening; other similarity-related applications of data fusion include the analysis of molecular diversity and of structure−activity landscapes inter alia. A previous review provided an overview of data fusion methods in ligand-based virtual screening up to 2005. However, the technique has now been so widely adopted that it is difficult to provide a comprehensive review, with a Google Scholar search in late 2012 for “Data Fusion” AND “Virtual Screening” identifying over 350 post-2005 items. Accordingly, after a description of the basic approach and the various ways in which it can be implemented in the next section, the review focuses on two specific aspects of data fusion: the various fusion rules that have been described in the literature for combining rankings; and work at Sheffield that seeks to provide a rationale for why data fusion methods work in practice. The focus here is the combination of similarity rankings but many of the methods described here are equally applicable to the combination of the rankings that result from the use of, e.g., machine learning techniques for screening chemical databases.

[1]  Ajay N. Jain,et al.  Chemical structural novelty: on-targets and off-targets. , 2011, Journal of medicinal chemistry.

[2]  Robert P. Sheridan,et al.  Chemical Similarity Using Physiochemical Property Descriptors , 1996, J. Chem. Inf. Comput. Sci..

[3]  Anselm Spoerri Authority and ranking effects in data fusion , 2008 .

[4]  Pierre Baldi,et al.  Large scale study of multiple-molecule queries , 2009, J. Cheminformatics.

[5]  Miklos Feher,et al.  Consensus scoring for protein-ligand interactions. , 2006, Drug discovery today.

[6]  Michael Mitzenmacher,et al.  A Brief History of Generative Models for Power Law and Lognormal Distributions , 2004, Internet Math..

[7]  Valerie J. Gillet,et al.  Analysis of Data Fusion Methods in Virtual Screening: Theoretical Model , 2006, J. Chem. Inf. Model..

[8]  Scott P. Brown,et al.  A unified, probabilistic framework for structure- and ligand-based virtual screening. , 2011, Journal of medicinal chemistry.

[9]  Peter Willett,et al.  Analysis and use of fragment-occurrence data in similarity-based virtual screening , 2009, J. Comput. Aided Mol. Des..

[10]  Robert P. Sheridan,et al.  Comparison of Topological, Shape, and Docking Methods in Virtual Screening. , 2007 .

[11]  Hanna Geppert,et al.  Integrating Structure‐ and Ligand‐Based Virtual Screening: Comparison of Individual, Parallel, and Fused Molecular Docking and Similarity Search Calculations on Multiple Targets , 2008, ChemMedChem.

[12]  George Papadatos,et al.  Evaluation of machine-learning methods for ligand-based virtual screening , 2007, J. Comput. Aided Mol. Des..

[13]  R D Hull,et al.  Mining the chemical quarry with joint chemical probes: an application of latent semantic structure indexing (LaSSI) and TOPOSIM (Dice) to chemical database mining. , 2001, Journal of medicinal chemistry.

[14]  Y. Martin,et al.  Beyond QSAR: Lead Hopping to Different Structures , 2009 .

[15]  Tudor I. Oprea,et al.  Virtual screening applications: a study of ligand-based methods and different structure representations in four different scenarios , 2007, J. Comput. Aided Mol. Des..

[16]  James Llinas,et al.  An introduction to multisensor data fusion , 1997, Proc. IEEE.

[17]  Miranda Lee Pao An empirical examination of Lotka's law , 1986, J. Am. Soc. Inf. Sci..

[18]  Peter Willett,et al.  Similarity Searching in Files of Three-Dimensional Chemical Structures: Evaluation of the EVA Descriptor and Combination of Rankings Using Data Fusion , 1997, J. Chem. Inf. Comput. Sci..

[19]  Pierre Baldi,et al.  Discovery of Power-Laws in Chemical Space , 2008, J. Chem. Inf. Model..

[20]  Andreas Bender,et al.  How similar are those molecules after all? Use two descriptors and you will have three different answers , 2010, Expert opinion on drug discovery.

[21]  Peter Willett,et al.  Virtual Screening Using Binary Kernel Discrimination: Analysis of Pesticide Data , 2006, J. Chem. Inf. Model..

[22]  Peter Willett,et al.  Combination Rules for Group Fusion in Similarity‐Based Virtual Screening , 2010, Molecular informatics.

[23]  John M. Barnard,et al.  Chemical Similarity Searching , 1998, J. Chem. Inf. Comput. Sci..

[24]  Andreas Bender,et al.  Bayesian methods in virtual screening and chemical biology. , 2011, Methods in molecular biology.

[25]  Pekka Tiikkainen,et al.  Comparison of structure fingerprint and molecular interaction field based methods in explaining biological similarity of small molecules in cell-based screens , 2009, J. Comput. Aided Mol. Des..

[26]  Sune Askjaer,et al.  Combining Pharmacophore Fingerprints and PLS-Discriminant Analysis for Virtual Screening and SAR Elucidation. , 2008 .

[27]  John W. Raymond,et al.  Conditional Probability: A New Fusion Method for Merging Disparate Virtual Screening Results , 2004, J. Chem. Inf. Model..

[28]  W. Patrick Walters,et al.  Chapter 8 Machine Learning in Computational Chemistry , 2006 .

[29]  Peter Willett,et al.  A simulation study of the use of similarity fusion for ligand-based virtual screening , 2010 .

[30]  Irene Luque Ruiz,et al.  Data Fusion of Similarity and Dissimilarity Measurements Using Wiener-Based Indices for the Prediction of the NPY Y5 Receptor Antagonist Capacity of Benzoxazinones , 2007, J. Chem. Inf. Model..

[31]  Paolo Benedetti,et al.  FLAP: GRID Molecular Interaction Fields in Virtual Screening. Validation using the DUD Data Set , 2010, J. Chem. Inf. Model..

[32]  Pekka Tiikkainen,et al.  Critical Comparison of Virtual Screening Methods against the MUV Data Set , 2009, J. Chem. Inf. Model..

[33]  P. Willett,et al.  Combination of molecular similarity measures using data fusion , 2000 .

[34]  Peter Willett,et al.  Similarity methods in chemoinformatics , 2009, Annu. Rev. Inf. Sci. Technol..

[35]  Naomie Salim,et al.  Implementing Relevance Feedback in Ligand-Based Virtual Screening Using Bayesian Inference Network , 2011, Journal of biomolecular screening.

[36]  Robert P Sheridan,et al.  Why do we need so many chemical similarity search methods? , 2002, Drug discovery today.

[37]  Austin B. Yongye,et al.  Multitarget Structure-Activity Relationships Characterized by Activity-Difference Maps and Consensus Similarity Measure , 2011, J. Chem. Inf. Model..

[38]  Hanna Geppert,et al.  Current Trends in Ligand-Based Virtual Screening: Molecular Representations, Data Mining Methods, New Application Areas, and Performance Evaluation , 2010, J. Chem. Inf. Model..

[39]  Thierry Kogej,et al.  Multifingerprint Based Similarity Searches for Targeted Class Compound Selection , 2006, J. Chem. Inf. Model..

[40]  Peter Willett,et al.  Enhancing the Effectiveness of Virtual Screening by Fusing Nearest Neighbor Lists: A Comparison of Similarity Coefficients , 2004, J. Chem. Inf. Model..

[41]  Peter Willett,et al.  Promoting Access to White Rose Research Papers Enhancing the Effectiveness of Ligand-based Virtual Screening Using Data Fusion , 2022 .

[42]  P. Willett,et al.  Comparison of topological descriptors for similarity-based virtual screening using multiple bioactive reference structures. , 2004, Organic & biomolecular chemistry.

[43]  D. Frank Hsu,et al.  Comparing Rank and Score Combination Methods for Data Fusion in Information Retrieval , 2005, Information Retrieval.

[44]  Constantinos S. Pattichis,et al.  De Novo Drug Design Using Multiobjective Evolutionary Graphs , 2009, J. Chem. Inf. Model..

[45]  Yvonne C. Martin,et al.  Application of Belief Theory to Similarity Data Fusion for Use in Analog Searching and Lead Hopping , 2008, J. Chem. Inf. Model..

[46]  Pierre Acklin,et al.  Similarity Metrics for Ligands Reflecting the Similarity of the Target Proteins , 2003, J. Chem. Inf. Comput. Sci..

[47]  Y. Martin,et al.  Do structurally similar molecules have similar biological activity? , 2002, Journal of medicinal chemistry.

[48]  Chris Williams,et al.  Reverse fingerprinting, similarity searching by group fusion and fingerprint bit importance , 2006, Molecular Diversity.

[49]  Peter Willett,et al.  Analysis of Data Fusion Methods in Virtual Screening: Similarity and Group Fusion , 2006, J. Chem. Inf. Model..

[50]  Robert P. Sheridan,et al.  Chemical Similarity Using Geometric Atom Pair Descriptors , 1996, J. Chem. Inf. Comput. Sci..

[51]  P. Willett,et al.  Enhancing the effectiveness of similarity-based virtual screening using nearest-neighbor information. , 2005, Journal of medicinal chemistry.

[52]  Fredrik Svensson,et al.  Virtual Screening Data Fusion Using Both Structure- and Ligand-Based Methods , 2012, J. Chem. Inf. Model..

[53]  Clemencia Pinilla,et al.  A Similarity‐based Data‐fusion Approach to the Visual Characterization and Comparison of Compound Databases , 2007, Chemical biology & drug design.

[54]  G. Maggiora,et al.  Molecular similarity measures. , 2004, Methods in molecular biology.

[55]  Peter Willett,et al.  Similarity‐based data mining in files of two‐dimensional chemical structures using fingerprint measures of molecular resemblance , 2011, WIREs Data Mining Knowl. Discov..

[56]  Jérôme Hert,et al.  New Methods for Ligand-Based Virtual Screening: Use of Data Fusion and Machine Learning to Enhance the Effectiveness of Similarity Searching , 2006, J. Chem. Inf. Model..

[57]  E. A. Fox,et al.  Combining the Evidence of Multiple Query Representations for Information Retrieval , 1995, Inf. Process. Manag..

[58]  Jonathan D Hirst,et al.  Machine learning in virtual screening. , 2009, Combinatorial chemistry & high throughput screening.

[59]  Miklos Feher,et al.  The Use of Consensus Scoring in Ligand-Based Virtual Screening , 2006, J. Chem. Inf. Model..

[60]  Johann Gasteiger,et al.  A Graph-Based Genetic Algorithm and Its Application to the Multiobjective Evolution of Median Molecules. , 2004 .

[61]  Belur V. Dasarathy A representative bibliography of surveys in the information fusion domain , 2010, Inf. Fusion.

[62]  James Llinas,et al.  Handbook of Multisensor Data Fusion : Theory and Practice, Second Edition , 2008 .

[63]  J. Bajorath,et al.  State-of-the-art in ligand-based virtual screening. , 2011, Drug discovery today.

[64]  Ruben Abagyan,et al.  Optimization of High Throughput Virtual Screening by Combining Shape‐Matching and Docking Methods. , 2008 .

[65]  David Rogers,et al.  Extended-Connectivity Fingerprints , 2010, J. Chem. Inf. Model..

[66]  Robert P Sheridan,et al.  Chemical similarity searches: when is complexity justified? , 2007, Expert opinion on drug discovery.

[67]  Jürgen Bajorath,et al.  Fingerprint Scaling Increases the Probability of Identifying Molecules with Similar Activity in Virtual Screening Calculations. , 2010 .

[68]  Peter J. Fleming,et al.  Combinatorial Library Design Using a Multiobjective Genetic Algorithm , 2002, J. Chem. Inf. Comput. Sci..

[69]  R. Glen,et al.  Molecular similarity: a key technique in molecular informatics. , 2004, Organic & biomolecular chemistry.

[70]  M. Newman Power laws, Pareto distributions and Zipf's law , 2005 .

[71]  David Weininger,et al.  Stigmata: An Algorithm To Determine Structural Commonalities in Diverse Datasets , 1996, J. Chem. Inf. Comput. Sci..

[72]  Evangelos Kanoulas,et al.  Multiple search methods for similarity-based virtual screening: analysis of search overlap and precision , 2011, J. Cheminformatics.

[73]  Jérôme Hert,et al.  Comparison of Fingerprint-Based Methods for Virtual Screening Using Multiple Bioactive Reference Structures , 2004, J. Chem. Inf. Model..

[74]  Jürgen Bajorath,et al.  Similarity searching , 2011 .

[75]  Michael J. Keiser,et al.  Relating protein pharmacology by ligand chemistry , 2007, Nature Biotechnology.

[76]  Qiang Zhang,et al.  Scaffold hopping through virtual screening using 2D and 3D similarity descriptors: ranking, voting, and consensus scoring. , 2006, Journal of medicinal chemistry.

[77]  D. Frank Hsu,et al.  Consensus Scoring Criteria for Improving Enrichment in Virtual Screening , 2005, J. Chem. Inf. Model..

[78]  Andreas Bender,et al.  How Similar Are Similarity Searching Methods? A Principal Component Analysis of Molecular Descriptor Space , 2009, J. Chem. Inf. Model..

[79]  Charles L. A. Clarke,et al.  Reciprocal rank fusion outperforms condorcet and individual rank learning methods , 2009, SIGIR.