Outlier Mining in High Throughput Screening Experiments

A data mining procedure for the rapid scoring of high-throughput screening (HTS) compounds is presented. The method is particularly useful for monitoring the quality of HTS data and tracking outliers in automated pharmaceutical or agrochemical screening, thus providing more complete and thorough structure-activity relationship (SAR) information. The method is based on the utilization of the assumed relationship between the structure of the screened compounds and the biological activity on a given screen expressed on a binary scale. By means of a data mining method, a SAR description of the data is developed that assigns probabilities of being a hit to each compound of the screen. Then, an inconsistency score expressing the degree of deviation between the adequacy of the SAR description and the actual biological activity is computed. The inconsistency score enables the identification of potential outliers that can be primed for validation experiments. The approach is particularly useful for detecting false-negative outliers and for identifying SAR-compliant hit/nonhit borderline compounds, both of which are classes of compounds that can contribute substantially to the development and understanding of robust SARs. In a first implementation of the method, one- and two-dimensional descriptors are used for encoding molecular structure information and logistic regression for calculating hits/nonhits probability scores. The approach was validated on three data sets, the first one from a publicly available screening data set and the second and third from in-house HTS screening campaigns. Because of its simplicity, robustness, and accuracy, the procedure is suitable for automation.

[1]  Desire L. Massart,et al.  Robust orthogonal regression for the outlier detection when comparing two series of measurement results , 1997 .

[2]  Malcolm J. McGregor,et al.  Pharmacophore Fingerprinting. 1. Application to QSAR and Focused Library Design , 1999, J. Chem. Inf. Comput. Sci..

[3]  T R Ward,et al.  Empirical modeling of an in vitro activity of polychlorinated biphenyl congeners and mixtures. , 1997, Environmental health perspectives.

[4]  Toshio Fujita,et al.  Quantitative structure-activity studies of pyrethroids: 14. Physicochemical structural effects of tetramethrin and its related compounds on knockdown activity against house flies , 1988 .

[5]  S. J. Press,et al.  Choosing between Logistic Regression and Discriminant Analysis , 1978 .

[6]  P. McCullagh,et al.  Generalized Linear Models , 1992 .

[7]  Vic Barnett,et al.  Outliers in Statistical Data , 1980 .

[8]  D. M. Hawkins Multivariate outlier detection , 1980 .

[9]  David Harding,et al.  Development of an automated high-throughput screening system : a case history , 1997 .

[10]  F. Lombardo,et al.  Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings , 1997 .

[11]  Rudi Verbeeck,et al.  CerBeruS: A System Supporting the Sequential Screening Process , 2000, J. Chem. Inf. Comput. Sci..

[12]  Mary Jo Wildey,et al.  Allegro™: Moving the Bar Upwards , 1999 .

[13]  J J Burbaum,et al.  The Evolution of Miniaturized Well Plates , 2000, Journal of biomolecular screening.

[14]  J H Zhang,et al.  Confirmation of primary active substances from high throughput screening of chemical and biological populations: a statistical approach and practical considerations. , 2000, Journal of combinatorial chemistry.

[15]  G. V. Kass,et al.  Location of Several Outliers in Multiple-Regression Data Using Elemental Sets , 1984 .

[16]  Ajay,et al.  Can we learn to distinguish between "drug-like" and "nondrug-like" molecules? , 1998, Journal of medicinal chemistry.

[17]  A. Atkinson Fast Very Robust Methods for the Detection of Multiple Outliers , 1994 .

[18]  Alan Dove,et al.  Drug screening—beyond the bottleneck , 1999, Nature Biotechnology.

[19]  J. Mason,et al.  New 4-point pharmacophore method for molecular similarity and diversity applications: overview of the method and applications, including a novel approach to the design of combinatorial libraries containing privileged substructures. , 1999, Journal of medicinal chemistry.

[20]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[21]  Wilhelm Stahl,et al.  What is the Future of High Throughput Screening? , 1999, Journal of biomolecular screening.

[22]  Yvonne C. Martin,et al.  The Information Content of 2D and 3D Structural Descriptors Relevant to Ligand-Receptor Binding , 1997, J. Chem. Inf. Comput. Sci..

[23]  Yvonne C. Martin,et al.  Use of Structure-Activity Data To Compare Structure-Based Clustering Methods and Descriptors for Use in Compound Selection , 1996, J. Chem. Inf. Comput. Sci..

[24]  Thomas D. Y. Chung,et al.  A Simple Statistical Parameter for Use in Evaluation and Validation of High Throughput Screening Assays , 1999, Journal of biomolecular screening.

[25]  Robin W. Spencer Diversity Analysis in High Throughput Screening , 1997 .

[26]  Paul Labute,et al.  Binary Quantitative Structure-Activity Relationship (QSAR) Analysis of Estrogen Receptor Ligands , 1999, J. Chem. Inf. Comput. Sci..

[27]  Toshio Fujita,et al.  Quantitative structure-Activity studies of pyrethroids: 28. Type differentiation of action-potential changes in crayfish giant axons caused by substituted benzyl chrysanthemates and pyrethrates , 1991 .

[28]  M. Boyd,et al.  New soluble-formazan assay for HIV-1 cytopathic effects: application to high-flux screening of synthetic and natural products for AIDS-antiviral activity. , 1989, Journal of the National Cancer Institute.

[29]  H. Kubinyi,et al.  A scoring scheme for discriminating between drugs and nondrugs. , 1998, Journal of medicinal chemistry.

[30]  Philip M. Dean,et al.  Molecular diversity in drug design , 2002 .