Application of Two-Part Statistics for Comparison of Sequence Variant Counts

Investigation of microbial communities, particularly human associated communities, is significantly enhanced by the vast amounts of sequence data produced by high throughput sequencing technologies. However, these data create high-dimensional complex data sets that consist of a large proportion of zeros, non-negative skewed counts, and frequently, limited number of samples. These features distinguish sequence data from other forms of high-dimensional data, and are not adequately addressed by statistical approaches in common use. Ultimately, medical studies may identify targeted interventions or treatments, but lack of analytic tools for feature selection and identification of taxa responsible for differences between groups, is hindering advancement. The objective of this paper is to examine the application of a two-part statistic to identify taxa that differ between two groups. The advantages of the two-part statistic over common statistical tests applied to sequence count datasets are discussed. Results from the t-test, the Wilcoxon test, and the two-part test are compared using sequence counts from microbial ecology studies in cystic fibrosis and from cenote samples. We show superior performance of the two-part statistic for analysis of sequence data. The improved performance in microbial ecology studies was independent of study type and sequence technology used.

[1]  D. Relman New technologies, human-microbe interactions, and the search for previously unrecognized pathogens. , 2002, The Journal of infectious diseases.

[2]  Jane Elith,et al.  Comparing species abundance models , 2006 .

[3]  Gary O Zerbe,et al.  Permutation‐based adjustments for the significance of partial regression coefficients in microarray data analysis , 2008, Genetic epidemiology.

[4]  J. Harris,et al.  A comparative molecular analysis of water-filled limestone sinkholes in north-eastern Mexico. , 2011, Environmental microbiology.

[5]  Taylor Sandra,et al.  Hypothesis tests for point-mass mixture data with application to 'omics data with many zero values. , 2009 .

[6]  Karl-Heinz Jöckel,et al.  Two-part permutation tests for DNA methylation and microarray data , 2005, BMC Bioinformatics.

[7]  W. Ludwig,et al.  SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB , 2007, Nucleic acids research.

[8]  Leah M. Feazel,et al.  The Human Nasal Microbiota and Staphylococcus aureus Carriage , 2010, PloS one.

[9]  P A Lachenbruch,et al.  Comparisons of two‐part models with competitors , 2001, Statistics in medicine.

[10]  Eoin L. Brodie,et al.  Greengenes, a Chimera-Checked 16S rRNA Gene Database and Workbench Compatible with ARB , 2006, Applied and Environmental Microbiology.

[11]  Philip Hugenholtz,et al.  NAST: a multiple sequence alignment server for comparative analysis of 16S rRNA genes , 2006, Nucleic Acids Res..

[12]  J. Eisen,et al.  Genomics of Emerging Infectious Disease: A PLoS Collection , 2009, PLoS biology.

[13]  Inna Dubchak,et al.  An experimental metagenome data management and analysis system , 2006, ISMB.

[14]  A. Hallstrom,et al.  A modified Wilcoxon test for non‐negative distributions with a clump of zeros , 2009, Statistics in medicine.

[15]  Sean R. Eddy,et al.  Infernal 1.0: inference of RNA alignments , 2009, Bioinform..

[16]  K. Schleifer,et al.  ARB: a software environment for sequence data. , 2004, Nucleic acids research.

[17]  Susan A. Murphy,et al.  Monographs on statistics and applied probability , 1990 .

[18]  C. Bascoul-Mollevi,et al.  Two‐part statistics with paired data , 2005, Statistics in medicine.

[19]  Hugh P Possingham,et al.  Zero tolerance ecology: improving ecological inference by modelling the source of zero observations. , 2005, Ecology letters.

[20]  Daniel N. Frank,et al.  BARCRAWL and BARTAB: software tools for the design and implementation of barcoded primers for highly multiplexed DNA sequencing , 2009, BMC Bioinformatics.

[21]  Siv G. E. Andersson,et al.  Computational Resources in Infectious Disease: Limitations and Challenges , 2009, PLoS Comput. Biol..

[22]  Les Dethlefsen,et al.  The Pervasive Effects of an Antibiotic on the Human Gut Microbiota, as Revealed by Deep 16S rRNA Sequencing , 2008, PLoS biology.

[23]  Sandra Taylor,et al.  Hypothesis tests for point-mass mixture data with application to 'omics data with many zero values. , 2009, Statistical applications in genetics and molecular biology.

[24]  David S. Wettergreen,et al.  Novel microbial diversity retrieved by autonomous robotic exploration of the world's deepest vertical phreatic sinkhole. , 2010, Astrobiology.

[25]  N. Pace,et al.  Rapid determination of 16S ribosomal RNA sequences for phylogenetic analyses. , 1985, Proceedings of the National Academy of Sciences of the United States of America.

[26]  Diane Lambert,et al.  Zero-inflacted Poisson regression, with an application to defects in manufacturing , 1992 .

[27]  F. Accurso,et al.  Airway inflammation in children with cystic fibrosis and healthy children assessed by sputum induction. , 2001, American journal of respiratory and critical care medicine.

[28]  J. Tiedje,et al.  Naïve Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy , 2007, Applied and Environmental Microbiology.

[29]  R. Simon,et al.  Controlling the number of false discoveries: application to high-dimensional genomic data , 2004 .

[30]  N. Pace A molecular view of microbial diversity and the biosphere. , 1997, Science.

[31]  E. Wit Design and Analysis of DNA Microarray Investigations , 2004, Human Genomics.

[32]  Aaron Christ,et al.  Mixed Effects Models and Extensions in Ecology with R , 2009 .

[33]  Peter A Lachenbruch,et al.  Analysis of data with excess zeros , 2002, Statistical methods in medical research.

[34]  S. Shen,et al.  The statistical analysis of compositional data , 1983 .

[35]  B. Haas,et al.  Chimeric 16S rRNA sequence formation and detection in Sanger and 454-pyrosequenced PCR amplicons. , 2011, Genome research.

[36]  Sean R. Eddy,et al.  Infernal 1.0: inference of RNA alignments , 2009, Bioinform..

[37]  A. Berger FUNDAMENTALS OF BIOSTATISTICS , 1969 .

[38]  James R. Cole,et al.  The Ribosomal Database Project: improved alignments and new tools for rRNA analysis , 2008, Nucleic Acids Res..

[39]  John Aitchison,et al.  The Statistical Analysis of Compositional Data , 1986 .

[40]  David Artis,et al.  Metagenomic analyses reveal antibiotic-induced temporal and spatial changes in intestinal microbiota with associated alterations in immune cell homeostasis , 2009, Mucosal Immunology.

[41]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.