Mapping genomic features to functional traits through microbial whole genome sequences

Recently, the utility of trait-based approaches for microbial communities has been identified. Increasing availability of whole genome sequences provide the opportunity to explore the genetic foundations of a variety of functional traits. We proposed a machine learning framework to quantitatively link the genomic features with functional traits. Genes from bacteria genomes belonging to different functional traits were grouped to Cluster of Orthologs (COGs), and were used as features. Then, TF-IDF technique from the text mining domain was applied to transform the data to accommodate the abundance and importance of each COG. After TF-IDF processing, COGs were ranked using feature selection methods to identify their relevance to the functional trait of interest. Extensive experimental results demonstrated that functional trait related genes can be detected using our method. Further, the method has the potential to provide novel biological insights.

[1]  Jae Won Lee,et al.  An extensive comparison of recent classification tools applied to microarray data , 2004, Comput. Stat. Data Anal..

[2]  Wei Zhang,et al.  A machine learning framework for trait based genomics , 2012, 2012 IEEE 2nd International Conference on Computational Advances in Bio and medical Sciences (ICCABS).

[3]  Robert D. Finn,et al.  Pfam: clans, web tools and services , 2005, Nucleic Acids Res..

[4]  P. Shannon,et al.  Cytoscape: a software environment for integrated models of biomolecular interaction networks. , 2003, Genome research.

[5]  Patrik D'haeseleer,et al.  Microbial genotype–phenotype mapping by class association rule mining , 2008, Bioinform..

[6]  Chris H. Q. Ding,et al.  Minimum Redundancy Feature Selection from Microarray Gene Expression Data , 2005, J. Bioinform. Comput. Biol..

[7]  B. Bohannan,et al.  Microbial Biogeography: From Taxonomy to Traits , 2008, Science.

[8]  N. Slonim,et al.  Ab initio genotype–phenotype association reveals intrinsic modularity in genetic networks , 2006, Molecular systems biology.

[9]  B. Snel,et al.  Toward Automatic Reconstruction of a Highly Resolved Tree of Life , 2006, Science.

[10]  C. Stoeckert,et al.  OrthoMCL: identification of ortholog groups for eukaryotic genomes. , 2003, Genome research.

[11]  R. Losick,et al.  Molecular genetics of sporulation in Bacillus subtilis. , 1996, Annual review of genetics.

[12]  Wei Zhang,et al.  A two-stage machine learning approach for pathway analysis , 2010, 2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[13]  Johann Gasteiger,et al.  Uncovering metabolic pathways relevant to phenotypic traits of microbial genomes , 2009, Genome Biology.

[14]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[15]  Steve Horvath,et al.  WGCNA: an R package for weighted correlation network analysis , 2008, BMC Bioinformatics.

[16]  D. A. Smith,et al.  The genetics of bacterial spore germination. , 1990, Annual review of microbiology.

[17]  Susumu Goto,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 2000, Nucleic Acids Res..

[18]  J. Lennon,et al.  Microbial seed banks: the ecological and evolutionary implications of dormancy , 2011, Nature Reviews Microbiology.

[19]  Kam-Fai Wong,et al.  Interpreting TF-IDF term weights as making relevance decisions , 2008, TOIS.

[20]  Peter Meinicke,et al.  Predicting phenotypic traits of prokaryotes from protein domain frequencies , 2010, BMC Bioinformatics.

[21]  D. Lipman,et al.  A genomic perspective on protein families. , 1997, Science.

[22]  Albert J. Vilella,et al.  Joining forces in the quest for orthologs , 2009, Genome Biology.

[23]  Mark Gerstein,et al.  An Integrative Genomic Approach to Uncover Molecular Mechanisms of Prokaryotic Traits , 2006, PLoS Comput. Biol..

[24]  Robert D. Finn,et al.  HMMER web server: interactive sequence similarity searching , 2011, Nucleic Acids Res..

[25]  A. Barabasi,et al.  Hierarchical Organization of Modularity in Metabolic Networks , 2002, Science.

[26]  Steven Salzberg,et al.  Identifying bacterial genes and endosymbiont DNA with Glimmer , 2007, Bioinform..

[27]  Mona Singh,et al.  A cross-genomic approach for systematic mapping of phenotypic traits to genes. , 2003, Genome research.

[28]  B. Enquist,et al.  Rebuilding community ecology from functional traits. , 2006, Trends in ecology & evolution.

[29]  Tapio Elomaa,et al.  General and Efficient Multisplitting of Numerical Attributes , 1999, Machine Learning.