Learning to Predict Carcinogenesis of Unstudied Chemicals in Rodents from Completed Rodent Trials

The National Toxicology Program (NTP) studies chemicals to determine if they are carcinogenic. These experiments include subchronic (90 day) and chronic (2 year) rodent exposures studies and, therefore, are costly and time consuming. The long-range goal of our research is to learn Bayesian belief networks from the NTP data and use them to predict the classification of chemicals at various milestones during the process when that information could be used to justify either continuing or terminating the experiments. NTP has data from 226 chemicals which have previously been classified. The data contain almost 1000 attributes, including the results of microbial assays, physicalchemical parameters, and the results of 90-day rodent exposure studies. While the data set contains continuous, discrete, and binary attributes, the majority of the attributes (836) are the subchronic exposure study results representing the presence or absence of organ pathology. Detection of a particular combination of organ and morphology (damage) is rare, so these attributes are very sparsely positive. This makes detecting significance of an attribute difficult. In addition, not all of the exposure studies are done for every chemical, so a number of chemicals are missing the attribute values for many of these attributes. Our approach to handling this complex data set has been to use various feature selection techniques and statistical analysis (such as linear discriminant analysis) on the exposure study result attributes. The models we build use the results of that analysis in combinations with the other attributes to predict various subsets of the chemical population. The cross-validated accuracies of these models have ranged from 70% to 92%. We are also building models from the data set directly. In analyzing the results of our preliminary models, it became apparent that much of the difficulty with this data set comes from the level of noise of the result attributes in the exposure studies. Our endpoint is a combination of many different types of carcinogenicity where each is likely to show different types of organ damage, so our model is called upon to find many biological pathways to many endpoints. It is our opinion that this wide endpoint is aggravating the noise in the data. In future experiments, we plan to replace this endpoint with endpoints for cancers in specific organs, the goal being to increase the accuracy and intuitiveness of our models.