Novel Classification Method Development for Microbiome Data

Novel Classification Method Development for Microbiome Data Fan Gao Master of Science Department of Public Health Sciences University of Toronto 2019 The microbiome data is a popular research topic used in recent medical science. However, the microbiome data has its own features, such as high skewness, sparsity, and abundant outliers, which pose huge challenge towards the classification. The classification method is a supervised learning method which builds a classifier to predict the grouping for unknown observations. In this project, we come up with a novel medoids-based classification method targeting toward microbiome data. The medoids-based classification method is constructed by locating the medoids for each group based on L pairwise distance measure. Next, we compare our method with the existing classification methods, such as logistic regression, LDA/MDA, CART, and KNN. The performance of these classification methods is evaluated based on the diagnostic test on simulation studies and the two real microbiome data. The sensitivity test will also be conducted to test on the outlier effect on the stability of the classification method.

[1]  Jiashu Zhang,et al.  Linear Discriminant Analysis Based on L1-Norm Maximization , 2013, IEEE Transactions on Image Processing.

[2]  S. Abbott,et al.  16S rRNA Gene Sequencing for Bacterial Identification in the Diagnostic Laboratory: Pluses, Perils, and Pitfalls , 2007, Journal of Clinical Microbiology.

[3]  Mevlut Ture,et al.  Comparing performances of logistic regression, classification and regression tree, and neural networks for predicting coronary artery disease , 2008, Expert Syst. Appl..

[4]  L. Borges,et al.  Diagnostic accuracy measures in cardiovascular research , 2016 .

[5]  M. Escobar,et al.  Analyzing differences between microbiome communities using mixture distributions , 2018, Statistics in Medicine.

[6]  Dongwon Lee,et al.  Semi-supervised dimensionality reduction for analyzing high-dimensional data with constraints , 2012, Neurocomputing.

[7]  Se Jin Song,et al.  The treatment-naive microbiome in new-onset Crohn's disease. , 2014, Cell host & microbe.

[8]  Ganapati P. Patil,et al.  The gamma distribution and weighted multimodal gamma distributions as models of population abundance , 1984 .

[9]  Shichao Zhang,et al.  Noisy data elimination using mutual k-nearest neighbor for classification mining , 2012, J. Syst. Softw..

[10]  T. Have,et al.  Structural and Sampling Zeros , 2005 .

[11]  Ming Li,et al.  2D-LDA: A statistical linear discriminant analysis for image matrix , 2005, Pattern Recognit. Lett..

[12]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[13]  Najaf Zare,et al.  K Important Neighbors: A Novel Approach to Binary Classification in High Dimensional Data , 2017, BioMed research international.

[14]  S. Czepiel,et al.  Maximum Likelihood Estimation of Logistic Regression Models : Theory and Implementation , 2022 .

[15]  Shyamal D. Peddada,et al.  Analysis of Microbiome Data in the Presence of Excess Zeros , 2017, Front. Microbiol..

[16]  Chih-Fong Tsai,et al.  The distance function effect on k-nearest neighbor classification for medical datasets , 2016, SpringerPlus.

[17]  S Shapiro,et al.  Report of the International Workshop on Screening for Breast Cancer. , 1993, Journal of the National Cancer Institute.

[18]  A. Paterson,et al.  Association of host genome with intestinal microbial composition in a large healthy cohort , 2016, Nature Genetics.

[19]  Melis N. Anahtar,et al.  Efficient Nucleic Acid Extraction and 16S rRNA Gene Sequencing for Bacterial Community Characterization , 2016, Journal of visualized experiments : JoVE.

[20]  Keewhan Choi,et al.  An Estimation Procedure for Mixtures of Distributions , 1968 .

[21]  Gary D. Scudder,et al.  A review and classification of empirical research in operations management , 1998 .

[22]  Claudia Beleites,et al.  Validation of soft classification models using partial class memberships: An extended concept of sensitivity & co. applied to grading of astrocytoma tissues , 2013, 1301.0264.

[23]  Christina Gloeckner,et al.  Modern Applied Statistics With S , 2003 .

[24]  Kevin Chu,et al.  An introduction to sensitivity, specificity, predictive values and likelihood ratios , 1999 .

[25]  R. Fisher,et al.  The Relation Between the Number of Species and the Number of Individuals in a Random Sample of an Animal Population , 1943 .

[26]  J. Andrew Royle,et al.  Introduction to Data Simulation , 2015 .

[27]  A. Segura,et al.  Multiclass classification methods in ecology , 2018 .

[28]  Juan de Oña,et al.  A classification tree approach to identify key factors of transit service quality , 2012, Expert Syst. Appl..

[29]  Jieping Ye,et al.  Two-Dimensional Linear Discriminant Analysis , 2004, NIPS.

[30]  B. Lindsay,et al.  A Penalized Nonparametric Maximum Likelihood Approach to Species Richness Estimation , 2005 .

[31]  R. Brereton,et al.  Comparison of performance of five common classifiers represented as boundary methods: Euclidean Distance to Centroids, Linear Discriminant Analysis, Quadratic Discriminant Analysis, Learning Vector Quantization and Support Vector Machines, as dependent on data structure , 2009 .

[32]  Jie Li,et al.  A survey of dimensionality reduction techniques based on random projection , 2017, ArXiv.