Application of data engineering approaches to address challenges in microbiome data for optimal medical decision-making

The human gut microbiota is known to contribute to numerous physiological functions of the body and also implicated in a myriad of pathological conditions. Prolific research work in the past few decades have yielded valuable information regarding the relative taxonomic distribution of gut microbiota. Unfortunately, the microbiome data suffers from class imbalance and high dimensionality issues that must be addressed. In this study, we have implemented data engineering algorithms to address the above-mentioned issues inherent to microbiome data. Four standard machine learning classifiers (logistic regression (LR), support vector machines (SVM), random forests (RF), and extreme gradient boosting (XGB) decision trees) were implemented on a previously published dataset. The issue of class imbalance and high dimensionality of the data was addressed through synthetic minority oversampling technique (SMOTE) and principal component analysis (PCA). Our results indicate that ensemble classifiers (RF and XGB decision trees) exhibit superior classification accuracy in predicting the host phenotype. The application of PCA significantly reduced testing time while maintaining high classification accuracy. The highest classification accuracy was obtained at the levels of species for most classifiers. The prototype employed in the study addresses the issues inherent to microbiome datasets and could be highly beneficial for providing personalized medicine.

[1]  C. Tiribelli,et al.  Gut Microbes Meet Machine Learning: The Next Step towards Advancing Our Understanding of the Gut Microbiome in Health and Disease , 2023, International journal of molecular sciences.

[2]  D. Anton-Păduraru,et al.  The Role of the Gut Microbiome in Psychiatric Disorders , 2022, Microorganisms.

[3]  A. Hassoun,et al.  Human gut microbiota in health and disease: Unveiling the relationship , 2022, Frontiers in Microbiology.

[4]  M. Kayser,et al.  Prediction of Smoking Habits From Class-Imbalanced Saliva Microbiome Data Using Data Augmentation and Machine Learning , 2022, Frontiers in microbiology.

[5]  K. Lê Cao,et al.  Statistical challenges in longitudinal microbiome data analysis , 2022, Briefings Bioinform..

[6]  C. Fernandez-Lozano,et al.  Machine Learning Based Microbiome Signature to Predict Inflammatory Bowel Disease Subtypes , 2022, Frontiers in Microbiology.

[7]  Dongya Zhang,et al.  Microbiota in health and diseases , 2022, Signal Transduction and Targeted Therapy.

[8]  B. Walker,et al.  Longitudinal multi-omics analyses link gut microbiome dysbiosis with recurrent urinary tract infections in women , 2022, Nature Microbiology.

[9]  Antonio Gonzalez,et al.  Applications and Comparison of Dimensionality Reduction Methods for Microbiome Data , 2022, Frontiers in Bioinformatics.

[10]  G. Ricevuti,et al.  The Potential Role of Gut Microbiota in Alzheimer’s Disease: From Diagnosis to Treatment , 2022, Nutrients.

[11]  Nicholas A. Bokulich,et al.  Multi-omics data integration reveals metabolome as the top predictor of the cervicovaginal microenvironment , 2022, PLoS Comput. Biol..

[12]  Wen-Ying Yu,et al.  Gut-Lung Microbiota in Chronic Pulmonary Diseases: Evolution, Pathogenesis, and Therapeutics , 2021, The Canadian journal of infectious diseases & medical microbiology = Journal canadien des maladies infectieuses et de la microbiologie medicale.

[13]  Y. Vázquez-Baeza,et al.  Uniform Manifold Approximation and Projection (UMAP) Reveals Composite Patterns and Resolves Visualization Artifacts in Microbiome Data , 2021, mSystems.

[14]  Yanxin Zhang,et al.  Systematic review of automatic assessment systems for resistance-training movement performance: A data science perspective , 2021, Comput. Biol. Medicine.

[15]  Georgios V. Gkoutos,et al.  NFnetFu: A novel workflow for microbiome data fusion , 2021, Comput. Biol. Medicine.

[16]  Taxiarchis Botsis,et al.  Feature engineering and machine learning for causality assessment in pharmacovigilance: Lessons learned from application to the FDA Adverse Event Reporting System , 2021, Comput. Biol. Medicine.

[17]  Amy Y. Pan,et al.  Statistical analysis of microbiome data: The challenge of sparsity , 2021, Current Opinion in Endocrine and Metabolic Research.

[18]  Subharup Guha,et al.  Metagenomic Geolocation Prediction Using an Adaptive Ensemble Classifier , 2021, Frontiers in Genetics.

[19]  M. Lopes,et al.  Applications of Machine Learning in Human Microbiome Studies: A Review on Feature Selection, Biomarker Identification, Disease Prediction and Treatment , 2021, Frontiers in Microbiology.

[20]  Byung-Soo Koo,et al.  Gut–Brain Axis: Role of Gut Microbiota on Neurological Disorders and How Probiotics/Prebiotics Beneficially Modulate Microbial and Immune Pathways to Improve Brain Functions , 2020, International journal of molecular sciences.

[21]  Arputharaj Kannan,et al.  Clinical data classification using an enhanced SMOTE and chaotic evolutionary feature selection , 2020, Comput. Biol. Medicine.

[22]  O. Pedersen,et al.  Gut microbiota in human metabolic health and disease , 2020, Nature Reviews Microbiology.

[23]  Xu-Wen Wang,et al.  Comparative study of classifiers for human microbiome data. , 2020, Medicine in microecology.

[24]  Samuel I. Miller,et al.  Fecal dysbiosis in infants with cystic fibrosis is associated with early linear growth failure , 2019, Nature Medicine.

[25]  T. Dinan,et al.  The gut microbiome in neurological disorders , 2019, The Lancet Neurology.

[26]  Wei Wang,et al.  MetaPheno: A critical evaluation of deep learning and machine learning in metagenome-based disease prediction. , 2019, Methods.

[27]  Siavash Mirarab,et al.  TADA: phylogenetic augmentation of microbiome samples enhances phenotype classification , 2019, Bioinform..

[28]  Dan Knights,et al.  Microbiome Learning Repo (ML Repo): A public repository of microbiome regression and classification tasks , 2019, GigaScience.

[29]  P. Bork,et al.  Metagenomic analysis of colorectal cancer datasets identifies cross-cohort microbial diagnostic signatures and a link with choline degradation , 2019, Nature Medicine.

[30]  B. Kan,et al.  Gut microbiota community characteristics and disease-related microorganism pattern in a population of healthy Chinese people , 2019, Scientific Reports.

[31]  W. Chey,et al.  The gut microbiome and irritable bowel syndrome , 2018, F1000Research.

[32]  Glenda MacQueen,et al.  The gut microbiota and psychiatric illness. , 2017, Journal of psychiatry & neuroscience : JPN.

[33]  Edoardo Pasolli,et al.  Machine Learning Meta-analysis of Large Metagenomic Datasets: Tools and Biological Insights , 2016, PLoS Comput. Biol..

[34]  T. Kanai,et al.  The gut microbiota and inflammatory bowel disease , 2014, Seminars in Immunopathology.

[35]  P. Cotter,et al.  Role of the gut microbiota in health and chronic gastrointestinal disease: understanding a hidden metabolic organ , 2013, Therapeutic advances in gastroenterology.

[36]  Rok Blagus,et al.  SMOTE for high-dimensional class-imbalanced data , 2013, BMC Bioinformatics.

[37]  Jung Ok Shim,et al.  Gut Microbiota in Inflammatory Bowel Disease , 2013, Pediatric gastroenterology, hepatology & nutrition.

[38]  Katherine H. Huang,et al.  Structure, Function and Diversity of the Healthy Human Microbiome , 2012, Nature.

[39]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[40]  Se Jin Song,et al.  LSU LSU A communal catalogue reveals Earth's multiscale microbial A communal catalogue reveals Earth's multiscale microbial diversity diversity , 2021 .

[41]  C. Nacitarhan,et al.  The role of gut microbiota in cardiovascular diseases , 2018 .

[42]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[43]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[44]  Table 5 , 2022 .