Efficient statistical significance approximation for local similarity analysis of high-throughput time series data

MOTIVATION Local similarity analysis of biological time series data helps elucidate the varying dynamics of biological systems. However, its applications to large scale high-throughput data are limited by slow permutation procedures for statistical significance evaluation. RESULTS We developed a theoretical approach to approximate the statistical significance of local similarity analysis based on the approximate tail distribution of the maximum partial sum of independent identically distributed (i.i.d.) random variables. Simulations show that the derived formula approximates the tail distribution reasonably well (starting at time points > 10 with no delay and > 20 with delay) and provides P-values comparable with those from permutations. The new approach enables efficient calculation of statistical significance for pairwise local similarity analysis, making possible all-to-all local association studies otherwise prohibitive. As a demonstration, local similarity analysis of human microbiome time series shows that core operational taxonomic units (OTUs) are highly synergetic and some of the associations are body-site specific across samples. AVAILABILITY The new approach is implemented in our eLSA package, which now provides pipelines for faster local similarity analysis of time series data. The tool is freely available from eLSA's website: http://meta.usc.edu/softs/lsa. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online. CONTACT fsun@usc.edu.

[1]  Li C. Xia,et al.  Accurate Genome Relative Abundance Estimation Based on Shotgun Metagenomic Reads , 2011, PloS one.

[2]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[3]  Jed A Fuhrman,et al.  Co-occurrence patterns for abundant marine archaeal and bacterial lineages in the deep chlorophyll maximum of coastal California , 2011, The ISME Journal.

[4]  Ker-Chau Li,et al.  Genome-wide coexpression dynamics: Theory and application , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Kian-Lee Tan,et al.  Identifying time-lagged gene clusters using gene expression data , 2005, Bioinform..

[6]  R. Knight,et al.  Moving pictures of the human microbiome , 2011, Genome Biology.

[7]  Ping-An He,et al.  Oligonucleotide profiling for discriminating bacteria in bacterial communities. , 2007, Combinatorial chemistry & high throughput screening.

[8]  Debojyoti Dutta,et al.  Local similarity analysis reveals unique associations among marine bacterioplankton species and environmental factors , 2006, Bioinform..

[9]  Pierre Vallois,et al.  Approximation of the Distribution of the Supremum of a Centered Random Walk. Application to the Local Score , 2004 .

[10]  M. Gerstein,et al.  Beyond synexpression relationships: local clustering of time-shifted and inverted gene expression profiles identifies new, biologically relevant interactions. , 2001, Journal of molecular biology.

[11]  Yudi Pawitan,et al.  False discovery rate, sensitivity and sample size for microarray studies , 2005, Bioinform..

[12]  Kishori M. Konwar,et al.  Expanding the boundaries of local similarity analysis , 2013, BMC Genomics.

[13]  Fengzhu Sun,et al.  Extended local similarity analysis (eLSA) of microbial community and other time series data with replicates , 2011, BMC Systems Biology.

[14]  S. Karlin,et al.  Applications and statistics for multiple high-scoring segments in molecular sequences. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[15]  I. Androulakis,et al.  Analysis of time-series gene expression data: methods, challenges, and opportunities. , 2007, Annual review of biomedical engineering.

[16]  Jean-Jacques Daudin,et al.  Asymptotic behavior of the local score of independent and identically distributed random sequences , 2003 .

[17]  Hubert Rehrauer,et al.  A global network of coexisting microbes from environmental and whole-genome sequence data. , 2010, Genome research.

[18]  K. McMahon,et al.  Differential bacterial dynamics promote emergent community robustness to lake mixing: an epilimnion to hypolimnion transplant experiment. , 2010, Environmental microbiology.

[19]  An-Ping Zeng,et al.  In search of functional association from time-series microarray data based on the change trend and level of gene expression , 2006, BMC Bioinformatics.

[20]  Eyke Hüllermeier,et al.  Clustering of gene expression data using a local shape-based similarity measure , 2005, Bioinform..

[21]  D. Caron,et al.  Marine bacterial, archaeal and protistan association networks reveal ecological linkages , 2011, The ISME Journal.

[22]  J. Reid Experimental Design and Data Analysis for Biologists , 2003 .

[23]  Amir Dembo,et al.  Statistical Composition of High-Scoring Segments from Molecular Sequences , 1990 .

[24]  W. Feller The Asymptotic Distribution of the Range of Sums of Independent Random Variables , 1951 .

[25]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[26]  Mark S Gilthorpe,et al.  Modelling count data with excessive zeros: The need for class prediction in zero‐inflated models and the issue of data generation in choosing between zero‐inflated and generic mixture models for dental caries data , 2009, Statistics in medicine.

[27]  Ziv Bar-Joseph,et al.  Analyzing time series gene expression data , 2004, Bioinform..