ABCModeller: an automatic data mining tool based on a consistent voting method with a user-friendly graphical interface

In order to extract useful information from a huge amount of biological data nowadays, simple and convenient tools are urgently needed for data analysis and modeling. In this paper, an automatic data mining tool, termed as ABCModeller (Automatic Binary Classification Modeller), with a user-friendly graphical interface was developed here, which includes automated functions as data preprocessing, significant feature extraction, classification modeling, model evaluation and prediction. In order to enhance the generalization ability of the final model, a consistent voting method was built here in this tool with the utilization of three popular machine-learning algorithms, as artificial neural network, support vector machine and random forest. Besides, Fibonacci search and orthogonal experimental design methods were also employed here to automatically select significant features in the data space and optimal hyperparameters of the three algorithms to achieve the best model. The reliability of this tool has been verified through multiple benchmark data sets. In addition, with the advantage of a user-friendly graphical interface of this tool, users without any programming skills can easily obtain reliable models directly from original data, which can reduce the complexity of modeling and data mining, and contribute to the development of related research including but not limited to biology. The excitable file of this tool can be downloaded from http://lishuyan.lzu.edu.cn/ABCModeller.rar.

[1]  J. Kiefer,et al.  Sequential minimax search for a maximum , 1953 .

[2]  Liping Gao,et al.  A Machine Learning Method for Identifying Lung Cancer Based on Routine Blood Indices: Qualitative Feasibility Study , 2019, JMIR medical informatics.

[3]  A. Kraskov,et al.  Estimating mutual information. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[4]  D. Sculley,et al.  Google Vizier: A Service for Black-Box Optimization , 2017, KDD.

[5]  D. McClish Analyzing a Portion of the ROC Curve , 1989, Medical decision making : an international journal of the Society for Medical Decision Making.

[6]  Brian C. Ross Mutual Information between Discrete and Continuous Data Sets , 2014, PloS one.

[7]  Nando de Freitas,et al.  Taking the Human Out of the Loop: A Review of Bayesian Optimization , 2016, Proceedings of the IEEE.

[8]  Vladimir Naumovich Vapni The Nature of Statistical Learning Theory , 1995 .

[9]  Ameet Talwalkar,et al.  Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization , 2016, J. Mach. Learn. Res..

[10]  Jing Tang,et al.  Identification of the gene signature reflecting schizophrenia’s etiology by constructing artificial intelligence‐based method of enhanced reproducibility , 2019, CNS neuroscience & therapeutics.

[11]  Liansheng Zhang,et al.  ATBdiscrimination: An in Silico Tool for Identification of Active Tuberculosis Disease Based on Routine Blood Test and T-SPOT.TB Detection Results , 2019, J. Chem. Inf. Model..

[12]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[13]  Feng Zhu,et al.  Convolutional neural network-based annotation of bacterial type IV secretion system effectors with enhanced accuracy and reduced false discovery , 2019, Briefings Bioinform..

[14]  Xiaofeng Li,et al.  Consistent gene signature of schizophrenia identified by a novel feature selection strategy from comprehensive sets of transcriptomic data , 2019, Briefings Bioinform..

[15]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Jason H. Moore,et al.  Spatially Uniform ReliefF (SURF) for computationally-efficient filtering of gene-gene interactions , 2009, BioData Mining.

[17]  George K. Acquaah-Mensah,et al.  Machine learning approaches to decipher hormone and HER2 receptor status phenotypes in breast cancer , 2019, Briefings Bioinform..

[18]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[19]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[20]  Feng Zhu,et al.  Protein functional annotation of simultaneously improved stability, accuracy and false discovery rate achieved by a sequence-based deep learning , 2019, Briefings Bioinform..

[21]  O. Mangasarian,et al.  Multisurface method of pattern separation for medical diagnosis applied to breast cytology. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[22]  L. Breiman Arcing Classifiers , 1998 .

[23]  Chih-Jen Lin,et al.  Probability Estimates for Multi-class Classification by Pairwise Coupling , 2003, J. Mach. Learn. Res..

[24]  Randal S. Olson,et al.  Benchmarking Relief-Based Feature Selection Methods , 2017, J. Biomed. Informatics.

[25]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[26]  W. Pitts,et al.  A Logical Calculus of the Ideas Immanent in Nervous Activity (1943) , 2021, Ideas That Created the Future.

[27]  Pengyi Zhang,et al.  GCdiscrimination: identification of gastric cancer based on a milliliter of blood , 2020, Briefings Bioinform..

[28]  Fernando Nogueira,et al.  Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning , 2016, J. Mach. Learn. Res..

[29]  Concha Bielza,et al.  Machine Learning in Bioinformatics , 2008, Encyclopedia of Database Systems.

[30]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..