论文信息 - Rigorous Selection of Random Forest Models for Identifying Compounds that Activate Toxicity-Related Pathways

Rigorous Selection of Random Forest Models for Identifying Compounds that Activate Toxicity-Related Pathways

Random forest (RF) is a machine-learning ensemble method with high predictive performance. Majority voting in RF uses the discrimination results in numerous decision trees produced from bootstrapping data. For the same dataset, the bootstrapping process yields different predictive capacities in each generation. As participants in the Toxicology in the 21st Century (Tox21) DATA Challenge 2014, we produced numerous RF models for predicting the structures of compounds that can activate each toxicity-related pathway, and then selected the model with the highest predictive ability. Half of the compounds in the training dataset supplied by the competition organizer were allocated to the validation dataset. The remaining compounds were used in model construction. The charged and uncharged forms of each molecule were calculated using the molecular operating environment (MOE) software. Subsequently, the descriptors were computed using MOE, MarvinView, and Dragon. These combined methods yielded over 4,071 descriptors for model construction. Using these descriptors, pattern recognition analyses were performed by RF implemented in JMP Pro (a statistical software package). A hundred to two hundred RF models were generated for each pathway. The predictive performance of each model was tested against the validation dataset, and the best-performing model was selected. In the competition, the latter model selected a best-performing model from the 50% test set that best predicted the structures of compounds that activate the estrogen receptor ligand-binding domain (ER-LBD).

Yoshihiro Uesawa | Y. Uesawa

[1] Jonathan D. Hirst,et al. Contemporary QSAR Classifiers Compared , 2007, J. Chem. Inf. Model..

[2] F. Burden. Molecular identification number for substructure searches , 1989, J. Chem. Inf. Comput. Sci..

[3] Allan Peter Davis,et al. Genetic and environmental pathways to complex diseases , 2009, BMC Systems Biology.

[4] Ruili Huang,et al. The future of toxicity testing: a focus on in vitro methods using a quantitative high-throughput screening platform. , 2010, Drug discovery today.

[5] Leo Breiman,et al. Random Forests , 2001, Machine Learning.

[6] R. Ettlin. Toxicologic Pathology in the 21st Century , 2013, Toxicologic pathology.

[7] Ruili Huang,et al. Tox21Challenge to Build Predictive Models of Nuclear Receptor and Stress Response Pathways as Mediated by Exposure to Environmental Chemicals and Drugs , 2016, Front. Environ. Sci..

[8] Thomas A. Halgren. Merck molecular force field. I. Basis, form, scope, parameterization, and performance of MMFF94 , 1996, J. Comput. Chem..

[9] Roberto Todeschini,et al. Molecular descriptors for chemoinformatics , 2009 .

[10] Ruili Huang,et al. A Data Analysis Pipeline Accounting for Artifacts in Tox21 Quantitative High-Throughput Screening Assays , 2015, Journal of biomolecular screening.

[11] C. Austin,et al. Improving the Human Hazard Characterization of Chemicals: A Tox21 Update , 2013, Environmental health perspectives.