ShinyLearner: A containerized benchmarking tool for machine-learning classification of tabular data

Classification algorithms assign observations to groups based on patterns in data. The machine-learning community have developed myriad classification algorithms, which are employed in diverse life-science research domains. When applying such algorithms, researchers face the challenge of deciding which algorithm(s) to apply in a given research domain. Algorithm choice can affect classification accuracy dramatically, so it is crucial that researchers optimize these choices based on empirical evidence rather than hearsay or anecdotal experience. In benchmark studies, multiple algorithms are applied to multiple datasets, and the researcher examines overall trends. In addition, the researcher may evaluate multiple hyperparameter combinations for each algorithm and use feature selection to reduce data dimensionality. Although software implementations of classification algorithms are widely available, robust benchmark comparisons are difficult to perform when researchers wish to compare algorithms that span multiple software packages. Programming interfaces, data formats, and evaluation procedures differ across software packages; and dependency conflicts may arise during installation. To address these challenges, we created ShinyLearner, an open-source project for integrating machine-learning packages into software containers. ShinyLearner provides a uniform interface for performing classification, irrespective of the library that implements each algorithm, thus facilitating benchmark comparisons. In addition, ShinyLearner enables researchers to optimize hyperparameters and select features via nested cross validation; it tracks all nested operations and generates output files that make these steps transparent. ShinyLearner includes a Web interface to help users more easily construct the commands necessary to perform benchmark comparisons. ShinyLearner is freely available at https://github.com/srp33/ShinyLearner.

[1]  Blaz Zupan,et al.  Orange: From Experimental Machine Learning to Interactive Data Mining , 2004, PKDD.

[2]  Vinod Sharma,et al.  Predicting Methylphenidate Response in ADHD Using Machine Learning Approaches , 2015, The international journal of neuropsychopharmacology.

[3]  Daniela Fischer,et al.  Digital Design And Computer Architecture , 2016 .

[4]  Senén Barro,et al.  Do we need hundreds of classifiers to solve real world classification problems? , 2014, J. Mach. Learn. Res..

[5]  Adam Krzyżak,et al.  Methods of combining multiple classifiers and their applications to handwriting recognition , 1992, IEEE Trans. Syst. Man Cybern..

[6]  Eibe Frank,et al.  Logistic Model Trees , 2003, Machine Learning.

[7]  Erhan Bilal,et al.  Improving Breast Cancer Survival Analysis through Competition-Based Multidimensional Modeling , 2013, PLoS Comput. Biol..

[8]  Carl Boettiger,et al.  An introduction to Docker for reproducible research , 2014, OPSR.

[9]  Jennifer Jie Xu,et al.  Knowledge Discovery and Data Mining , 2014, Computing Handbook, 3rd ed..

[10]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[11]  Pedro M. Valero-Mora,et al.  ggplot2: Elegant Graphics for Data Analysis , 2010 .

[12]  Donald A Szlosek,et al.  Using Machine Learning and Natural Language Processing Algorithms to Automate the Evaluation of Clinical Decision Support in Electronic Medical Record Systems , 2016, EGEMS.

[13]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[14]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[15]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[16]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[17]  Jennifer Roberts,et al.  Exploring the Factors Contributing to Sibling Correlations in BMI: A Study Using the Panel Study of Income Dynamics , 2011, Obesity.

[18]  Burt L. Monroe,et al.  Partial Justification of the Borda Count , 1998 .

[19]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[20]  Robert C. Holte,et al.  Very Simple Classification Rules Perform Well on Most Commonly Used Datasets , 1993, Machine Learning.

[21]  Richard Simon,et al.  Bias in error estimation when using cross-validation for model selection , 2006, BMC Bioinformatics.

[22]  Sargur N. Srihari,et al.  Decision Combination in Multiple Classifier Systems , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[23]  K. Pearson VII. Note on regression and inheritance in the case of two parents , 1895, Proceedings of the Royal Society of London.

[24]  Constantin F. Aliferis,et al.  A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification , 2008, BMC Bioinformatics.

[25]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[26]  Ian H. Witten,et al.  Weka-A Machine Learning Workbench for Data Mining , 2005, Data Mining and Knowledge Discovery Handbook.

[27]  Max Kuhn,et al.  Building Predictive Models in R Using the caret Package , 2008 .

[28]  Uli K. Chettipally,et al.  Prediction of Sepsis in the Intensive Care Unit With Minimal Electronic Health Record Data: A Machine Learning Approach , 2016, JMIR medical informatics.

[29]  Lior Rokach,et al.  Data Mining And Knowledge Discovery Handbook , 2005 .

[30]  Stephen R. Piccolo,et al.  Tools and techniques for computational reproducibility , 2016, GigaScience.

[31]  Chao Sima,et al.  Performance of Feature Selection Methods , 2009, Current genomics.

[32]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[33]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[34]  Lewis J. Frey,et al.  ML-Flex: A Flexible Toolbox for Performing Classification Analyses In Parallel , 2012, J. Mach. Learn. Res..

[35]  Nada Lavrac,et al.  The Multi-Purpose Incremental Learning System AQ15 and Its Testing Application to Three Medical Domains , 1986, AAAI.

[36]  Thorsten Meinl,et al.  KNIME: The Konstanz Information Miner , 2007, GfKl.

[37]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[38]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[39]  Todd,et al.  Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning , 2002, Nature Medicine.

[40]  C. Harley,et al.  Analysis of E. coli promoter sequences. , 1987, Nucleic acids research.

[41]  Maurizio Vichi,et al.  Studies in Classification Data Analysis and knowledge Organization , 2011 .

[42]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[43]  Stephen R. Piccolo,et al.  Gene-expression patterns in peripheral blood classify familial breast cancer susceptibility , 2015, BMC Medical Genomics.

[44]  Constantin F. Aliferis,et al.  A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis , 2004, Bioinform..

[45]  Léopold Simar,et al.  Computer Intensive Methods in Statistics , 1994 .

[46]  Adam A. Margolin,et al.  Assessing the clinical utility of cancer genomic and proteomic data across tumor types , 2014, Nature Biotechnology.

[47]  W. Kruskal,et al.  Use of Ranks in One-Criterion Variance Analysis , 1952 .

[48]  D. M. Green,et al.  Signal detection theory and psychophysics , 1966 .

[49]  Lewis J. Frey,et al.  Clinical and molecular models of Glioblastoma multiforme survival , 2013, Int. J. Data Min. Bioinform..

[50]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[51]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[52]  Aixia Guo,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2014 .

[53]  William Stafford Noble,et al.  Machine learning applications in genetics and genomics , 2015, Nature Reviews Genetics.

[54]  D. Urhahne Learning approaches , 2020, Educational Psychology.

[55]  G. Brier VERIFICATION OF FORECASTS EXPRESSED IN TERMS OF PROBABILITY , 1950 .

[56]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[57]  H. Altay Güvenir,et al.  Learning differential diagnosis of erythemato-squamous diseases using voting feature intervals , 1998, Artif. Intell. Medicine.

[58]  Y. Ho,et al.  Simple Explanation of the No-Free-Lunch Theorem and Its Implications , 2002 .

[59]  Paul Horton,et al.  A Probabilistic Classification System for Predicting the Cellular Localization Sites of Proteins , 1996, ISMB.

[60]  T. Golub,et al.  Gene expression-based classification of malignant gliomas correlates better with survival than histological classification. , 2003, Cancer research.

[61]  Frank Rosenblatt,et al.  PRINCIPLES OF NEURODYNAMICS. PERCEPTRONS AND THE THEORY OF BRAIN MECHANISMS , 1963 .

[62]  A. Madabhushi,et al.  Intratumoral and peritumoral radiomics for the pretreatment prediction of pathological complete response to neoadjuvant chemotherapy based on breast DCE-MRI , 2017, Breast Cancer Research.

[63]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[64]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[65]  Frederick Reiss,et al.  Compressed linear algebra for large-scale machine learning , 2016, The VLDB Journal.

[66]  Gaël Varoquaux,et al.  The NumPy Array: A Structure for Efficient Numerical Computation , 2011, Computing in Science & Engineering.

[67]  Seymour Geisser,et al.  The Predictive Sample Reuse Method with Applications , 1975 .

[68]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[69]  Randal S. Olson,et al.  PMLB: a large benchmark suite for machine learning evaluation and comparison , 2017, BioData Mining.

[70]  C. Wilke Streamlined Plot Theme and Plot Annotations for 'ggplot2' , 2015 .

[71]  Nairanjana Dasgupta,et al.  Analyzing Categorical Data , 2004, Technometrics.