Chemical Data Mining of the NCI Human Tumor Cell Line Database

The NCI Developmental Therapeutics Program Human Tumor cell line data set is a publicly available database that contains cellular assay screening data for over 40 000 compounds tested in 60 human tumor cell lines. The database also contains microarray assay gene expression data for the cell lines, and so it provides an excellent information resource particularly for testing data mining methods that bridge chemical, biological, and genomic information. In this paper we describe a formal knowledge discovery approach to characterizing and data mining this set and report the results of some of our initial experiments in mining the set from a chemoinformatics perspective.

[1]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[2]  John N. Weinstein,et al.  Mining the NCI Anticancer Drug Discovery Databases: Genetic Function Approximation for the QSAR Study of Anticancer Ellipticine Analogues , 1998, J. Chem. Inf. Comput. Sci..

[3]  Glenn J. Myatt,et al.  LeadScope: Software for Exploring Large Sets of Screening Data , 2000, J. Chem. Inf. Comput. Sci..

[4]  J N Weinstein,et al.  Mining the National Cancer Institute Anticancer Drug Discovery Database: cluster analysis of ellipticine analogs with p53-inverse and central nervous system-selective patterns of activity. , 1998, Molecular pharmacology.

[5]  Rajarshi Guha,et al.  Web Service Infrastructure for Chemoinformatics , 2007, J. Chem. Inf. Model..

[6]  George Karypis,et al.  C HAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling , 1999 .

[7]  M. Fligner,et al.  Systematic analysis of large screening sets in drug discovery. , 2004, Current drug discovery technologies.

[8]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[9]  Hui Zhang,et al.  Web-Based Tools for Mining the NCI Databases for Anticancer Drug Discovery , 2004, J. Chem. Inf. Model..

[10]  John M. Barnard,et al.  Clustering Methods and Their Uses in Computational Chemistry , 2003 .

[11]  Robert P. Sheridan,et al.  Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling , 2003, J. Chem. Inf. Comput. Sci..

[12]  Aidong Zhang,et al.  WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases , 1998, VLDB.

[13]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[14]  Willi Klösgen,et al.  A Support System for Interpreting Statistical Data , 1991, Knowledge Discovery in Databases.

[15]  Rajarshi Guha,et al.  Development of Linear, Ensemble, and Nonlinear Models for the Prediction and Interpretation of the Biological Activity of a Set of PDGFR Inhibitors , 2004, J. Chem. Inf. Model..

[16]  J. Weinstein,et al.  Pharmacogenomic analysis: correlating molecular substructure classes with microarray gene expression data , 2002, The Pharmacogenomics Journal.

[17]  Stefan Kramer,et al.  Learning a Predictive Model for Growth Inhibition from the NCI DTP Human Tumor Cell Line Screening Data: Does Gene Expression Make a Difference? , 2006, Pacific Symposium on Biocomputing.

[18]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[19]  Sung Jin Cho,et al.  Binary Formal Inference-Based Recursive Modeling Using Multiple Atom and Physicochemical Property Class Pair and Torsion Descriptors as Decision Criteria , 2000, J. Chem. Inf. Comput. Sci..

[20]  S. O'Brien,et al.  Greater than the sum of its parts: combining models for useful ADMET prediction. , 2005, Journal of medicinal chemistry.

[21]  Stan Matwin,et al.  Machine Learning for the Detection of Oil Spills in Satellite Radar Images , 1998, Machine Learning.

[22]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[23]  D A Scudiero,et al.  Display and analysis of patterns of differential activity of drugs against human tumor cell lines: development of mean graph and COMPARE algorithm. , 1989, Journal of the National Cancer Institute.

[24]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[25]  D. Botstein,et al.  A gene expression database for the molecular pharmacology of cancer , 2000, Nature Genetics.

[26]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[27]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[28]  G. S. Johnson,et al.  An Information-Intensive Approach to the Molecular Pharmacology of Cancer , 1997, Science.

[29]  P. Renne,et al.  Age and Duration of Weathering by 40K-40Ar and 40Ar/39Ar Analysis of Potassium-Manganese Oxides , 1992, Science.

[30]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[31]  C. Barbacioru,et al.  Correlating gene expression with chemical scaffolds of cytotoxic agents: ellipticines as substrates and inhibitors of MDR1 , 2005, The Pharmacogenomics Journal.

[32]  D. Zaharevitz,et al.  COMPARE: a web accessible tool for investigating mechanisms of cell growth inhibition. , 2002, Journal of molecular graphics & modelling.

[33]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[34]  J N Weinstein,et al.  Neural computing in cancer drug development: predicting mechanism of action. , 1992, Science.

[35]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[36]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[37]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[38]  Jiong Yang,et al.  STING: A Statistical Information Grid Approach to Spatial Data Mining , 1997, VLDB.

[39]  E. Sausville,et al.  Mining the National Cancer Institute's tumor-screening database: identification of compounds with similar cellular activities. , 2002, Journal of medicinal chemistry.