DCMD: Distance-based classification using mixture distributions on microbiome data

Current advances in next generation sequencing techniques have allowed researchers to conduct comprehensive research on microbiome and human diseases, with recent studies identifying associations between human microbiome and health outcomes for a number of chronic conditions. However, microbiome data structure, characterized by sparsity and skewness, presents challenges to building effective classifiers. To address this, we present an innovative approach for distance-based classification using mixture distributions (DCMD). The method aims to improve classification performance when using microbiome community data, where the predictors are composed of sparse and heterogeneous count data. This approach models the inherent uncertainty in sparse counts by estimating a mixture distribution for the sample data, and representing each observation as a distribution, conditional on observed counts and the estimated mixture, which are then used as inputs for distance-based classification. The method is implemented into a k-means and k-nearest neighbours framework and we identify two distance metrics that produce optimal results. The performance of the model is assessed using simulations and applied to a human microbiome study, with results compared against a number of existing machine learning and distance-based approaches. The proposed method is competitive when compared to the machine learning approaches and showed a clear improvement over commonly used distance-based classifiers. The range of applicability and robustness make the proposed method a viable alternative for classification using sparse microbiome count data.

[1]  Elaine Larson,et al.  Impact of Technical Sources of Variation on the Hand Microbiome Dynamics of Healthcare Workers , 2014, PloS one.

[2]  M. Escobar,et al.  Estimating total species using a weighted combination of expected mixture distribution component counts , 2020, Environmental and Ecological Statistics.

[3]  Johan A. K. Suykens,et al.  Least Squares Support Vector Machine Classifiers , 1999, Neural Processing Letters.

[4]  Liping Zhao,et al.  Modulation of gut microbiota by berberine and metformin during the treatment of high-fat diet-induced obesity in rats , 2015, Scientific Reports.

[5]  Fredrik H. Karlsson,et al.  Gut metagenome in European women with normal, impaired and diabetic glucose control , 2013, Nature.

[6]  Greg Ridgeway,et al.  Generalized Boosted Models: A guide to the gbm package , 2006 .

[7]  Max Kuhn,et al.  Building Predictive Models in R Using the caret Package , 2008 .

[8]  J. Nocedal Updating Quasi-Newton Matrices With Limited Storage , 1980 .

[9]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[10]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[11]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[12]  B. Birren,et al.  Genomic analysis identifies association of Fusobacterium with colorectal carcinoma. , 2012, Genome research.

[13]  Dan Knights,et al.  Microbiome Learning Repo (ML Repo): A public repository of microbiome regression and classification tasks , 2019, GigaScience.

[14]  Zhongheng Zhang,et al.  Introduction to machine learning: k-nearest neighbors. , 2016, Annals of translational medicine.

[15]  Zhenqiu Liu,et al.  Sparse distance-based learning for simultaneous multiclass classification and feature selection of metagenomic data , 2011, Bioinform..

[16]  R. Knight,et al.  Supervised classification of human microbiota. , 2011, FEMS microbiology reviews.

[17]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[18]  V. Young,et al.  The gut microbiome in health and in disease , 2015, Current opinion in gastroenterology.

[19]  Timothy L. Tickle,et al.  Dysfunction of the intestinal microbiome in inflammatory bowel disease and treatment , 2012, Genome Biology.

[20]  Alexander Statnikov,et al.  A comprehensive evaluation of multicategory classification methods for microbiomic data , 2013, Microbiome.

[21]  Tao Wang,et al.  Prediction analysis for microbiome sequencing data , 2017, Biometrics.

[22]  M. Escobar,et al.  Analyzing differences between microbiome communities using mixture distributions , 2018, Statistics in Medicine.

[23]  Jesse R. Zaneveld,et al.  Normalization and microbial differential abundance strategies depend upon data characteristics , 2017, Microbiome.

[24]  Se Jin Song,et al.  The treatment-naive microbiome in new-onset Crohn's disease. , 2014, Cell host & microbe.

[25]  Tao Wang,et al.  Constructing Predictive Microbial Signatures at Multiple Taxonomic Levels , 2017 .

[26]  Evgeny Putin,et al.  Human microbiome aging clocks based on deep learning and tandem of permutation feature importance and accumulated local effects , 2018, bioRxiv.

[27]  K. Kojima Proceedings of the fifth Berkeley symposium on mathematical statistics and probability. , 1969 .

[28]  Vladik Kreinovich,et al.  Handbook of Granular Computing , 2008 .

[29]  P. Toint,et al.  A globally convergent augmented Lagrangian algorithm for optimization with general constraints and simple bounds , 1991 .

[30]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[31]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[32]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[33]  Éric Gaussier,et al.  A Probabilistic Interpretation of Precision, Recall and F-Score, with Implication for Evaluation , 2005, ECIR.