mikropml: User-Friendly R Package for Supervised Machine Learning Pipelines

Summary Machine learning (ML) for classification and prediction based on a set of features is used to make decisions in healthcare, economics, criminal justice and more. However, implementing an ML pipeline including preprocessing, model selection, and evaluation can be time-consuming, confusing, and difficult. Here, we present mikropml (prononced “meek-ROPE em el”), an easy-to-use R package that implements ML pipelines using regression, support vector machines, decision trees, random forest, or gradient-boosted trees. The package is available on GitHub, CRAN, and conda.

[1]  Kurt Hornik,et al.  kernlab - An S4 Package for Kernel Methods in R , 2004 .

[2]  Sven Rahmann,et al.  Genome analysis , 2022 .

[3]  E. Snitkin,et al.  Machine learning models to identify patient and microbial genetic factors associated with carbapenem-resistant Klebsiella pneumoniae infection , 2020, medRxiv.

[4]  Marzyeh Ghassemi,et al.  Turning the crank for machine learning: ease, at what expense? , 2019, The Lancet. Digital health.

[5]  Max Kuhn,et al.  Building Predictive Models in R Using the caret Package , 2008 .

[6]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[7]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[8]  P. Schloss,et al.  Women Are Underrepresented and Receive Differential Outcomes at ASM Journals: a Six-Year Retrospective Analysis , 2020, mBio.

[9]  Andrew E Teschendorff,et al.  Avoiding common pitfalls in machine learning omic data science , 2018, Nature Materials.

[10]  Hadley Wickham,et al.  ggplot2 - Elegant Graphics for Data Analysis (2nd Edition) , 2017 .

[11]  David C. Kale,et al.  Do no harm: a roadmap for responsible machine learning for health care , 2019, Nature Medicine.

[12]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[13]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[14]  R. Kagan C code , 2020, The Hero’s Mask.

[15]  Danai Koutra,et al.  Democratizing EHR analyses with FIDDLE: a flexible data-driven preprocessing pipeline for structured clinical data , 2020, J. Am. Medical Informatics Assoc..

[16]  Cynthia Rudin,et al.  All Models are Wrong, but Many are Useful: Learning a Variable's Importance by Studying an Entire Class of Prediction Models Simultaneously , 2019, J. Mach. Learn. Res..

[17]  Jenna Wiens,et al.  A Framework for Effective Application of Machine Learning to Microbiome-Based Classification Problems , 2019, mBio.

[18]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .