Optimized Ensemble Machine Learning Framework for High Dimensional Imbalanced Bio Assays

Received: 22 July 2019 Accepted: 30 September 2019 In pharmaceutical research, a recent hotspot is the study of the activity of bioactive compounds and drugs with computational intelligence. The relevant studies often adopt machine learning techniques to speed up the modelling, and rely on bioassay to evaluate the effect and potency of a compound or drug. This paper aims to design an efficient and accurate method to assess the activity of bioactive compounds and drugs. First, the authors performed virtual screening on the data on bioactive compounds and drugs, eliminating the imbalanced classes and high dimensionality of drug descriptors. Next, eight machine learning algorithms, namely Bayes Net, Naive Bayes, SMO, J48, Random Forest, AdaBoost, AdaBag and logistic regression, were trained by the virtually screened data, and used to predict the activity or inactivity of a drug through bioassays. The synthetic minority oversampling technique (SMOTE) was employed to solve the numerous imbalanced datasets in bioassay. On this basis, the ensemble machine learning model of random forest was optimized. Experimental results show that the optimized random forest machine learning framework achieved better results than the other ensemblebased machine learning methods. The research provides an effective way to perform bioassays on high-dimensional imbalanced data.

[1]  Seema Bawa,et al.  B2FSE framework for high dimensional imbalanced data: A case study for drug toxicity prediction , 2018, Neurocomputing.

[2]  Manole-Stefan Niculescu Optical method for improving the accuracy of biochemical assays , 2017, 2017 E-Health and Bioengineering Conference (EHB).

[3]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[4]  Mariana Belgiu,et al.  Random forest in remote sensing: A review of applications and future directions , 2016 .

[5]  J. Platt Sequential Minimal Optimization : A Fast Algorithm for Training Support Vector Machines , 1998 .

[6]  Neeraj Bhargava,et al.  Decision Tree Analysis on J48 Algorithm for Data Mining , 2013 .

[7]  Evan Bolton,et al.  PubChem's BioAssay Database , 2011, Nucleic Acids Res..

[8]  Finn V. Jensen,et al.  Bayesian Networks and Decision Graphs , 2001, Statistics for Engineering and Information Science.

[9]  Seema Bawa,et al.  Fraudulent Firm Classification: A Case Study of an External Audit , 2018, Appl. Artif. Intell..

[10]  Gary King,et al.  Logistic Regression in Rare Events Data , 2001, Political Analysis.

[11]  Amanda C. Schierz Virtual screening of bioassay data , 2009, J. Cheminformatics.

[12]  Stephen H Bryant,et al.  An efficient algorithm coupled with synthetic minority over-sampling technique to classify imbalanced PubChem BioAssay data. , 2014, Analytica chimica acta.

[13]  Timon Schroeter Machine learning in drug discovery and drug design , 2009 .

[14]  Fernando De la Torre,et al.  Facing Imbalanced Data--Recommendations for the Use of Performance Metrics , 2013, 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction.

[15]  J. Baell,et al.  New substructure filters for removal of pan assay interference compounds (PAINS) from screening libraries and for their exclusion in bioassays. , 2010, Journal of medicinal chemistry.

[16]  Bin Chen,et al.  PubChem BioAssays as a data source for predictive models. , 2010, Journal of molecular graphics & modelling.

[17]  P. Deepa Shenoy,et al.  Aspect term extraction for sentiment analysis in large movie reviews using Gini Index feature selection method and SVM classifier , 2016, World Wide Web.

[18]  Matías Gámez,et al.  adabag: An R Package for Classification with Boosting and Bagging , 2013 .

[19]  Seema Bawa,et al.  Optimizing Fraudulent Firm Prediction Using Ensemble Machine Learning: A Case Study of an External Audit , 2019, Appl. Artif. Intell..

[20]  Thomas Hartung,et al.  Nonanimal Models for Acute Toxicity Evaluations: Applying Data-Driven Profiling and Read-Across , 2019, Environmental health perspectives.

[21]  Francisco Herrera,et al.  Addressing data complexity for imbalanced data sets: analysis of SMOTE-based oversampling and evolutionary undersampling , 2011, Soft Comput..

[22]  Sotiris B. Kotsiantis,et al.  Machine learning: a review of classification and combining techniques , 2006, Artificial Intelligence Review.

[23]  Nils-Ole Friedrich,et al.  Hit Dexter: A Machine‐Learning Model for the Prediction of Frequent Hitters , 2018, ChemMedChem.