Unraveling the origin of splice switching activity of hemoglobin β-globin gene modulators via QSAR modeling

Abstract β -Thalassemia is a blood disease caused by a mutation in the second intron of the β -globin gene of hemoglobin that leads to abnormal hemoglobin production. Low molecular weight compounds have been proposed to modulate defective splicing by binding unwanted splicing sites, thereby restoring correct splicing. This study investigates the origin of this splice switching activity in a set of 39 active and 61,000 inactive compounds. The K -means algorithm was applied to the inactive compound points with 39 clusters, in which a point from each cluster was selected to create a balanced data set of 39 active and inactive compounds. To avoid random bias, predictive models (i.e., decision tree (DT), random forest (RF), artificial neural network (ANN), partial least squares discriminant analysis (PLS-DA) and support vector machine (SVM)) were constructed 50 times. The performances of the predictive models were statistically assessed in terms of accuracy, sensitivity, specificity and Matthews correlation coefficient (MCC). RF provided an accuracy of 89.50 ± 13.45, sensitivity of 94.97 ± 13.49, specificity of 84.29 ± 22.27, and MCC of 0.80 ± 0.25 for 10-fold CV, and it provided and accuracy of 88.00 ± 8.55, sensitivity of 87.89 ± 13.93, specificity of 87.51 ± 13.75, and MCC of 0.75 ± 0.18 for external testing. Taking advantage of the built-in feature selector of RF, a thorough analysis of feature importance was conducted. Newly identified fingerprint substructures, namely, three carbon-hetero bonds (i.e., secondary amide, tertiary amide, carboxyl derivative, carboxylic acid derivative and nitrile), carbon-carbon bonds (i.e., primary carbon, secondary carbon and alkene), aromatics (hetero N nonbasic) and carbon-hetero bond (alkyl aryl ether), may provide a better understanding of the structural variations governing the splice switching activity of the hemoglobin β -globin gene.

[1]  Marc C. Nicklaus,et al.  QSAR Modeling of Imbalanced High-Throughput Screening Data in PubChem , 2014, J. Chem. Inf. Model..

[2]  Kurt Hornik,et al.  Open-source machine learning: R meets Weka , 2009, Comput. Stat..

[3]  Kurt Hornik,et al.  Misc Functions of the Department of Statistics (e1071), TU Wien , 2014 .

[4]  William N. Venables,et al.  Modern Applied Statistics with S , 2010 .

[5]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[6]  Chartchalerm Isarankura-Na-Ayudhya,et al.  Advances in computational methods to predict the biological activity of compounds , 2010, Expert opinion on drug discovery.

[7]  Víctor Urrea,et al.  Letter to the Editor: Stability of Random Forest importance measures , 2011, Briefings Bioinform..

[8]  Ola Spjuth,et al.  Introduction to Pharmaceutical Bioinformatics , 2010 .

[9]  Yanli Wang,et al.  PubChem: a public information system for analyzing bioactivities of small molecules , 2009, Nucleic Acids Res..

[10]  CHUN WEI YAP,et al.  PaDEL‐descriptor: An open source software to calculate molecular descriptors and fingerprints , 2011, J. Comput. Chem..

[11]  Chartchalerm Isarankura-Na-Ayudhya,et al.  Exploring the origins of structure–oxygen affinity relationship of human haemoglobin allosteric effector , 2015 .

[12]  S. Agrawal,et al.  Repair of thalassemic human beta-globin mRNA in mammalian cells by antisense oligonucleotides. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[13]  G. Graziadei,et al.  Oral isobutyramide therapy in patients with thalassemia intermedia: results of a phase II open study. , 2000, Blood cells, molecules & diseases.

[14]  M. Cronin,et al.  Pitfalls in QSAR , 2003 .

[15]  S. Saw,et al.  Predicting the Oligomeric States of Fluorescent Proteins , 2015 .

[16]  Frederick P. Roth,et al.  Chemical substructures that enrich for biological activity , 2008, Bioinform..

[17]  Max Kuhn,et al.  Building Predictive Models in R Using the caret Package , 2008 .

[18]  Ola Spjuth,et al.  Benchmarking Study of Parameter Variation When Using Signature Fingerprints Together with Support Vector Machines , 2014, J. Chem. Inf. Model..

[19]  Alexander Tropsha,et al.  Trust, But Verify: On the Importance of Chemical Structure Curation in Cheminformatics and QSAR Modeling Research , 2010, J. Chem. Inf. Model..

[20]  M. Bradai,et al.  Hydroxyurea can eliminate transfusion requirements in children with severe beta-thalassemia. , 2003, Blood.

[21]  R. Kole,et al.  RNA modulation, repair and remodeling by splice switching oligonucleotides. , 2004, Acta biochimica Polonica.

[22]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[23]  A. Schmidtchen,et al.  Pathological conditions involving extracellular hemoglobin: molecular mechanisms, clinical significance, and novel therapeutic opportunities for α(1)-microglobulin. , 2012, Antioxidants & redox signaling.

[24]  Ewa Heyduk,et al.  Molecular beacons for detecting DNA binding proteins , 2002, Nature Biotechnology.

[25]  Chris Morley,et al.  Open Babel: An open chemical toolbox , 2011, J. Cheminformatics.

[26]  K. Gates,et al.  Biologically relevant chemical reactions of N7-alkylguanine residues in DNA. , 2004, Chemical research in toxicology.

[27]  Chartchalerm Isarankura-Na-Ayudhya,et al.  A practical overview of quantitative structure-activity relationship , 2009 .