A Comparative Study of Machine Learning Algorithms Applied to Predictive Toxicology Data Mining

This paper reports results of a comparative study of widely used machine learning algorithms applied to predictive toxicology data mining. The machine learning algorithms involved were chosen in terms of their representability and diversity, and were extensively evaluated with seven toxicity data sets which were taken from real-world applications. Some results based on visual analysis of the correlations of different descriptors to the class values of chemical compounds, and on the relationships of the range of chosen descriptors to the performance of machine learning algorithms, are emphasised from our experiments. Some interesting findings relating to the data and the quality of the models are presented — for example, that no specific algorithm appears best for all seven toxicity data sets, and that up to five descriptors are sufficient for creating classification models for each toxicity data set with good accuracy. We suggest that, for a specific data set, model accuracy is affected by the feature selection method and model development technique. Models built with too many or too few descriptors are undesirable, and finding the optimal feature subset appears at least as important as selecting appropriate algorithms with which to build a final model.

[1]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[2]  Giuseppina C. Gini,et al.  The Importance of Scaling in Data Mining for Toxicity Prediction , 2002, J. Chem. Inf. Comput. Sci..

[3]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[4]  Nabil Derbel,et al.  Direct Adaptive Fuzzy Moving Sliding Mode Controller Design for Robotic Manipulators , 2005, Int. J. Comput. Intell. Appl..

[5]  Erik Johansson,et al.  Regression- and Projection-Based Approaches in Predictive Toxicology , 2005 .

[6]  Daniel Neagu,et al.  Fuzzy Knnmodel Applied to Predictive Toxicology Data Mining , 2005, Int. J. Comput. Intell. Appl..

[7]  Giuseppina C. Gini,et al.  Tuning Neural and Fuzzy-Neural Networks for Toxicity Modeling , 2003, J. Chem. Inf. Comput. Sci..

[8]  Rich Caruana,et al.  An empirical comparison of supervised learning algorithms , 2006, ICML.

[9]  Daniel Neagu,et al.  Similarity-based classifier combination for decision making , 2005, 2005 IEEE International Conference on Systems, Man and Cybernetics.

[10]  Daniel A. Keim,et al.  Information Visualization and Visual Data Mining , 2002, IEEE Trans. Vis. Comput. Graph..

[11]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[12]  Shanshan Wang,et al.  An Effective Combination Based on Class-Wise Expertise of Diverse Classifiers for Predictive Toxicology Data Mining , 2006, ADMA.

[13]  Peter McBurney,et al.  The use of expert systems for toxicology risk prediction , 2005 .

[14]  Hongxing Li,et al.  Fuzzy Neural Network Theory and Application , 2004, Series in Machine Perception and Artificial Intelligence.

[15]  Daniel Neagu,et al.  Using kNN Model for Automatic Feature Selection , 2005, ICAPR.

[16]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[17]  Gregory F. Cooper,et al.  A Bayesian Method for the Induction of Probabilistic Networks from Data , 1992 .

[18]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[19]  Ian Witten,et al.  Data Mining , 2000 .

[20]  T. W. Schultz,et al.  TETRATOX: TETRAHYMENA PYRIFORMIS POPULATION GROWTH IMPAIRMENT ENDPOINTA SURROGATE FOR FISH LETHALITY , 1997 .