Pattern recognition approach to classifying CYP 2C19 isoform

In this paper a pattern recognition approach to classifying quantitative structure-property relationships (QSPR) of the CYP2C19 isoform is presented. QSPR is a correlative computer modelling of the properties of chemical molecules and is widely used in cheminformatics and the pharmaceutical industry. Predicting whether or not a particular chemical will be metabolized by 2C19 is of primary importance to the pharmaceutical industry. This task poses certain challenges. First of all analyzed data are characterized by a significant biological noise. Additionally the training set is unbalanced, with objects from negative class outnumbering the positives four times. Presented solution deals with those problems, additionally incorporating a throughout feature selection for improving the stability of received results. A strong emphasis is put on the outlier detection and proper model validation to achieve the best predictive power.

[1]  Johann Gasteiger,et al.  Chemoinformatics - An Important Scientific Discipline , 2006 .

[2]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[3]  Longin Jan Latecki,et al.  Improving SVM classification on imbalanced time series data sets with ghost points , 2011, Knowledge and Information Systems.

[4]  Huan Liu,et al.  Efficient Feature Selection via Analysis of Relevance and Redundancy , 2004, J. Mach. Learn. Res..

[5]  Nathalie Japkowicz,et al.  Boosting Support Vector Machines for Imbalanced Data Sets , 2008, ISMIS.

[6]  Ross D. King,et al.  COMPARISON OF ARTIFICIAL INTELLIGENCE METHODS FOR MODELING PHARMACEUTICAL QSARS , 1995 .

[7]  Kurt Hornik,et al.  Support Vector Machines in R , 2006 .

[8]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[9]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[10]  Richard Y. Wang,et al.  Data Quality , 2000, Advances in Database Systems.

[11]  De-Shuang Huang,et al.  An evolutionary modular neural network for unbalanced pattern classifications , 2007, 2007 IEEE Congress on Evolutionary Computation.

[12]  Ying Liu,et al.  A Comparative Study on Feature Selection Methods for Drug Discovery , 2004, J. Chem. Inf. Model..

[13]  Bernard F. Buxton,et al.  Drug Design by Machine Learning: Support Vector Machines for Pharmaceutical Data Analysis , 2001, Comput. Chem..

[14]  Koby Crammer,et al.  On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines , 2002, J. Mach. Learn. Res..

[15]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[16]  David Barber,et al.  Bayesian Classification With Gaussian Processes , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[17]  David G. Stork,et al.  Pattern Classification , 1973 .

[18]  Kurt Hornik,et al.  kernlab - An S4 Package for Kernel Methods in R , 2004 .

[19]  H Ichikawa,et al.  Neural networks applied to structure-activity relationships. , 1990, Journal of medicinal chemistry.

[20]  Masoud Nikravesh,et al.  Feature Extraction - Foundations and Applications , 2006, Feature Extraction.

[21]  Ethem Alpaydın,et al.  Combined 5 x 2 cv F Test for Comparing Supervised Classification Learning Algorithms , 1999, Neural Comput..

[22]  Peter Gund,et al.  Editorial overview: whither the pharmaceutical industry? , 2005, Current opinion in drug discovery & development.

[23]  Thomas C. Redman,et al.  Data Quality: The Field Guide , 2001 .

[24]  Joshua Lederberg,et al.  Applications of Artificial Intelligence for Organic Chemistry: The DENDRAL Project , 1980 .

[25]  Frank Brown Editorial opinion: chemoinformatics - a ten year update. , 2005, Current opinion in drug discovery & development.

[26]  Jing Peng,et al.  Classifying Unbalanced Pattern Groups by Training Neural Network , 2006, ISNN.

[27]  R. Brereton,et al.  Handbook of chemoinformatics: from data to knowledge, edited by Johann Gasteiger, Volumes 1–4. Wiley‐VCH, Weinheim, 2003, ISBN 3527306803, €485 , 2004 .