A machine learning framework for trait based genomics

Microbial communities perform many important ecological functions across a wide range of natural and man-made environments. Recently, the utility of trait based approaches for microbial communities has been identified. Increasing availability of whole genome sequences provide the opportunity to explore the genetic foundations of a variety of functional traits. In this paper, we proposed a machine learning framework to quantitatively link the genotype with functional traits. Genes from bacteria genomes belonging to different functional trait groups were grouped to Cluster of Orthologs (COGs), and were used as features. Then, TF-IDF technique from the text mining domain was applied to transform the data to accommodate the abundance and importance of each COG. After TF-IDF processing, COGs were ranked using feature selection methods to identify their relevance to the functional trait of interest. We focused on a binary functional trait in this paper, but plan to extend our approach to continuous functional traits in the future. Experimental results demonstrated that functional trait related genes can be detected using our method.

[1]  Hubert Rehrauer,et al.  A global network of coexisting microbes from environmental and whole-genome sequence data. , 2010, Genome research.

[2]  Chris H. Q. Ding,et al.  Minimum redundancy feature selection from microarray gene expression data , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[3]  Rick Stevens,et al.  Erratum: Functional metagenomic profiling of nine biomes , 2008 .

[4]  Albert J. Vilella,et al.  Joining forces in the quest for orthologs , 2009, Genome Biology.

[5]  Steven Salzberg,et al.  Identifying bacterial genes and endosymbiont DNA with Glimmer , 2007, Bioinform..

[6]  N. LeRoy Poff,et al.  Functional trait niches of North American lotic insects: traits-based ecological applications in light of phylogenetic relationships , 2006, Journal of the North American Benthological Society.

[7]  B. Bohannan,et al.  Microbial Biogeography: From Taxonomy to Traits , 2008, Science.

[8]  Rick L. Stevens,et al.  Functional metagenomic profiling of nine biomes , 2008, Nature.

[9]  Elena Litchman,et al.  Trait-Based Community Ecology of Phytoplankton , 2008 .

[10]  J. Lennon,et al.  Microbial seed banks: the ecological and evolutionary implications of dormancy , 2011, Nature Reviews Microbiology.

[11]  Robert D. Finn,et al.  HMMER web server: interactive sequence similarity searching , 2011, Nucleic Acids Res..

[12]  Tapio Elomaa,et al.  General and Efficient Multisplitting of Numerical Attributes , 1999, Machine Learning.

[13]  D. A. Smith,et al.  The genetics of bacterial spore germination. , 1990, Annual review of microbiology.

[14]  R. Knight,et al.  Supervised classification of human microbiota. , 2011, FEMS microbiology reviews.

[15]  S. Tringe,et al.  Comparative Metagenomics of Microbial Communities , 2004, Science.

[16]  C. Stoeckert,et al.  OrthoMCL: identification of ortholog groups for eukaryotic genomes. , 2003, Genome research.

[17]  Kam-Fai Wong,et al.  Interpreting TF-IDF term weights as making relevance decisions , 2008, TOIS.

[18]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[19]  Zhenqiu Liu,et al.  Sparse distance-based learning for simultaneous multiclass classification and feature selection of metagenomic data , 2011, Bioinform..

[20]  Thomas J. Webb,et al.  Biodiversity's Big Wet Secret: The Global Distribution of Marine Biological Records Reveals Chronic Under-Exploration of the Deep Pelagic Ocean , 2010, PloS one.

[21]  B. Enquist,et al.  Rebuilding community ecology from functional traits. , 2006, Trends in ecology & evolution.

[22]  R. Losick,et al.  Molecular genetics of sporulation in Bacillus subtilis. , 1996, Annual review of genetics.

[23]  Jae Won Lee,et al.  An extensive comparison of recent classification tools applied to microarray data , 2004, Comput. Stat. Data Anal..

[24]  P. Bork,et al.  Toward molecular trait-based ecology through integration of biogeochemical, geographical and metagenomic data , 2011, Molecular systems biology.