An improved dimensionality reduction method for meta-transcriptome indexing based diseases classification

BackgroundBacterial 16S Ribosomal RNAs profiling have been widely used in the classification of microbiota associated diseases. Dimensionality reduction is among the keys in mining high-dimensional 16S rRNAs' expression data. High levels of sparsity and redundancy are common in 16S rRNA gene microbial surveys. Traditional feature selection methods are generally restricted to measuring correlated abundances, and are limited in discrimination when so few microbes are actually shared across communities.ResultsHere we present a Feature Merging and Selection algorithm (FMS) to deal with 16S rRNAs' expression data. By integrating Linear Discriminant Analysis method, FMS can reduce the feature dimension with higher accuracy and preserve the relationship between different features as well. Two 16S rRNAs' expression datasets of pneumonia and dental decay patients were used to test the validity of the algorithm. Combined with SVM, FMS discriminated different classes of both pneumonia and dental caries better than other popular feature selection methods.ConclusionsFMS projects data into lower dimension with preservation of enough features, and thus improve the intelligibility of the result. The results showed that FMS is a more valid and reliable methods in feature reduction.

[1]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[2]  J. Stockman,et al.  Metabolic Syndrome and Altered Gut Microbiota in Mice Lacking Toll-Like Receptor 5 , 2012 .

[3]  Salvatore J. Stolfo,et al.  Toward Scalable Learning with Non-Uniform Class and Cost Distributions: A Case Study in Credit Card Fraud Detection , 1998, KDD.

[4]  Jae Won Lee,et al.  An extensive comparison of recent classification tools applied to microarray data , 2004, Comput. Stat. Data Anal..

[5]  Chris H. Q. Ding,et al.  Minimum redundancy feature selection from microarray gene expression data , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[6]  Nobuhisa Yamane,et al.  Characteristics of Legionella pneumophila serogroup 2 strains by colony morphology. , 2008, Japanese journal of infectious diseases.

[7]  David G. Stork,et al.  Pattern Classification , 1973 .

[8]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[9]  Haifeng Lu,et al.  Symbiotic gut microbes modulate human metabolic phenotypes , 2008, Proceedings of the National Academy of Sciences.

[10]  R. Ley,et al.  Innate immunity and intestinal microbiota in the development of Type 1 diabetes , 2008, Nature.

[11]  C. Daub,et al.  BMC Systems Biology , 2007 .

[12]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[13]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[14]  J. Doré,et al.  Faecalibacterium prausnitzii is an anti-inflammatory commensal bacterium identified by gut microbiota analysis of Crohn disease patients , 2008, Proceedings of the National Academy of Sciences.

[15]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[17]  Christine Nardini,et al.  Adapting functional genomic tools to metagenomic analyses: investigating the role of gut bacteria in relation to obesity. , 2010, Briefings in functional genomics.

[18]  David Zhang,et al.  An improved LDA approach , 2004, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[19]  Gerwin C. Raangs,et al.  Variations of Bacterial Populations in Human Feces Measured by Fluorescent In Situ Hybridization with Group-Specific 16S rRNA-Targeted Oligonucleotide Probes , 1998, Applied and Environmental Microbiology.

[20]  Qing Yang,et al.  The combination approach of SVM and ECOC for powerful identification and classification of transcription factor , 2008, BMC Bioinformatics.

[21]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[22]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[23]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[24]  Takashi Kido,et al.  Severe Pneumonia with Leptotrichia sp. Detected Predominantly in Bronchoalveolar Lavage Fluid by Use of 16S rRNA Gene Sequencing Analysis , 2008, Journal of Clinical Microbiology.

[25]  Elisabeth M Bik,et al.  Molecular analysis of the bacterial microbiota in the human stomach. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[26]  R. Bhatnagar,et al.  Bacterial diversity analysis of larvae and adult midgut microflora using culture-dependent and culture-independent methods in lab-reared and field-collected Anopheles stephensi-an Asian malarial vector , 2009, BMC Microbiology.

[27]  L. J. Wei,et al.  Asymptotic Conservativeness and Efficiency of Kruskal-Wallis Test for K Dependent Samples , 1981 .

[28]  R. Knight,et al.  Supervised classification of human microbiota. , 2011, FEMS microbiology reviews.

[29]  Kai Chen,et al.  Ovarian cancer classification based on dimensionality reduction for SELDI-TOF data , 2010, BMC Bioinformatics.

[30]  Huajun Zheng,et al.  Analysis of the microbiota of sputum samples from patients with lower respiratory tract infections. , 2010, Acta biochimica et biophysica Sinica.

[31]  Martin J. Blaser,et al.  Quantitation of Major Human Cutaneous Bacterial and Fungal Populations , 2010, Journal of Clinical Microbiology.

[32]  C. Deming,et al.  Topographical and Temporal Diversity of the Human Skin Microbiome , 2009, Science.

[33]  Chaochun Wei,et al.  Analysis of Oral Microbiota in Children with Dental Caries by PCR-DGGE and Barcoded Pyrosequencing , 2010, Microbial Ecology.

[34]  Huan Liu,et al.  Chi2: feature selection and discretization of numeric attributes , 1995, Proceedings of 7th IEEE International Conference on Tools with Artificial Intelligence.

[35]  Michael J. Pazzani,et al.  Reducing Misclassification Costs , 1994, ICML.

[36]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .