CCLasso: correlation inference for compositional data through Lasso

MOTIVATION Direct analysis of microbial communities in the environment and human body has become more convenient and reliable owing to the advancements of high-throughput sequencing techniques for 16S rRNA gene profiling. Inferring the correlation relationship among members of microbial communities is of fundamental importance for genomic survey study. Traditional Pearson correlation analysis treating the observed data as absolute abundances of the microbes may lead to spurious results because the data only represent relative abundances. Special care and appropriate methods are required prior to correlation analysis for these compositional data. RESULTS In this article, we first discuss the correlation definition of latent variables for compositional data. We then propose a novel method called CCLasso based on least squares with [Formula: see text] penalty to infer the correlation network for latent variables of compositional data from metagenomic data. An effective alternating direction algorithm from augmented Lagrangian method is used to solve the optimization problem. The simulation results show that CCLasso outperforms existing methods, e.g. SparCC, in edge recovery for compositional data. It also compares well with SparCC in estimating correlation network of microbe species from the Human Microbiome Project. AVAILABILITY AND IMPLEMENTATION CCLasso is open source and freely available from https://github.com/huayingfang/CCLasso under GNU LGPL v3. CONTACT dengmh@pku.edu.cn SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Y. Takane,et al.  Generalized Inverse Matrices , 2011 .

[2]  John Aitchison,et al.  The Statistical Analysis of Compositional Data , 1986 .

[3]  P. Filzmoser,et al.  Correlation Analysis for Compositional Data , 2009 .

[4]  J. Atchison,et al.  Logistic-normal distributions:Some properties and uses , 1980 .

[5]  J. Handelsman,et al.  Molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products. , 1998, Chemistry & biology.

[6]  Jonathan Friedman,et al.  Inferring Correlation Networks from Genomic Survey Data , 2012, PLoS Comput. Biol..

[7]  Katherine H. Huang,et al.  A framework for human microbiome research , 2012, Nature.

[8]  F. Bushman,et al.  Structure-constrained sparse canonical correlation analysis with an application to microbiome data analysis. , 2013, Biostatistics.

[9]  M. Yuan,et al.  Model selection and estimation in the Gaussian graphical model , 2007 .

[10]  H. Zou,et al.  Sparse precision matrix estimation via lasso penalized D-trace loss , 2014 .

[11]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[12]  Katherine H. Huang,et al.  Structure, Function and Diversity of the Healthy Human Microbiome , 2012, Nature.

[13]  Mihai Pop,et al.  Microbiome Metagenomic Analysis of the Human Distal Gut , 2009 .

[14]  M. Pop,et al.  Metagenomic Analysis of the Human Distal Gut Microbiome , 2006, Science.

[15]  Vladimir Jojic,et al.  Learning Microbial Interaction Networks from Metagenomic Count Data , 2014, J. Comput. Biol..

[16]  E. Pikuta,et al.  Microbial Extremophiles at the Limits of Life , 2007, Critical reviews in microbiology.

[17]  Stephen M. Stigler,et al.  Studies in the history of probability and statistics, L: Karl Pearson and the Rule of Three , 2012 .

[18]  D. Savage Microbial ecology of the gastrointestinal tract. , 1977, Annual review of microbiology.

[19]  Q. Yan,et al.  How much metagenomic sequencing is enough to achieve a given goal? , 2013, Scientific Reports.

[20]  Alan Agresti,et al.  Bayesian inference for categorical data analysis , 2005, Stat. Methods Appl..

[21]  K. Pearson Mathematical contributions to the theory of evolution.—On a form of spurious correlation which may arise when indices are used in the measurement of organs , 1897, Proceedings of the Royal Society of London.

[22]  Curtis Huttenhower,et al.  Microbial Co-occurrence Relationships in the Human Microbiome , 2012, PLoS Comput. Biol..

[23]  Emmanuel J. Candès,et al.  Decoding by linear programming , 2005, IEEE Transactions on Information Theory.

[24]  M. Pop,et al.  Robust methods for differential abundance analysis in marker gene surveys , 2013, Nature Methods.