MetaPalette: a k-mer Painting Approach for Metagenomic Taxonomic Profiling and Quantification of Novel Strain Variation

Taxonomic profiling is a challenging first step when analyzing a metagenomic sample. This work presents a method that facilitates fine-scale characterization of the presence, abundance, and evolutionary relatedness of organisms present in a given sample but absent from the training database. We calculate a “k-mer palette” which summarizes the information from all reads, not just those in conserved genes or containing taxon-specific markers. The compositions of palettes are easy to model, allowing rapid inference of community composition. In addition to providing strain-level information where applicable, our approach provides taxonomic profiles that are more accurate than those of competing methods. ABSTRACT Metagenomic profiling is challenging in part because of the highly uneven sampling of the tree of life by genome sequencing projects and the limitations imposed by performing phylogenetic inference at fixed taxonomic ranks. We present the algorithm MetaPalette, which uses long k-mer sizes (k = 30, 50) to fit a k-mer “palette” of a given sample to the k-mer palette of reference organisms. By modeling the k-mer palettes of unknown organisms, the method also gives an indication of the presence, abundance, and evolutionary relatedness of novel organisms present in the sample. The method returns a traditional, fixed-rank taxonomic profile which is shown on independently simulated data to be one of the most accurate to date. Tree figures are also returned that quantify the relatedness of novel organisms to reference sequences, and the accuracy of such figures is demonstrated on simulated spike-ins and a metagenomic soil sample. The software implementing MetaPalette is available at: https://github.com/dkoslicki/MetaPalette . Pretrained databases are included for Archaea, Bacteria, Eukaryota, and viruses. IMPORTANCE Taxonomic profiling is a challenging first step when analyzing a metagenomic sample. This work presents a method that facilitates fine-scale characterization of the presence, abundance, and evolutionary relatedness of organisms present in a given sample but absent from the training database. We calculate a “k-mer palette” which summarizes the information from all reads, not just those in conserved genes or containing taxon-specific markers. The compositions of palettes are easy to model, allowing rapid inference of community composition. In addition to providing strain-level information where applicable, our approach provides taxonomic profiles that are more accurate than those of competing methods. Author Video: An author video summary of this article is available.

[1]  Mihai Pop,et al.  TIPP: taxonomic identification and phylogenetic profiling , 2014, Bioinform..

[2]  Stuart R. Borrett,et al.  Structure of pathways in ecological networks: relationships between length and number , 2003 .

[3]  Armando J. Pinho,et al.  A Genomic Distance for Assembly Comparison Based on Compressed Maximal Exact Matches , 2013, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[4]  J. Kemeny,et al.  Denumerable Markov chains , 1969 .

[5]  M. Stojnic,et al.  $\ell_{2}/\ell_{1}$ -Optimization in Block-Sparse Compressed Sensing and Its Strong Thresholds , 2010, IEEE Journal of Selected Topics in Signal Processing.

[6]  Derrick E. Wood,et al.  Kraken: ultrafast metagenomic sequence classification using exact alignments , 2014, Genome Biology.

[7]  Alice Carolyn McHardy,et al.  Taxator-tk: precise taxonomic assignment of metagenomes by fast approximation of evolutionary neighborhoods , 2014, Bioinform..

[8]  W. Bialek,et al.  Maximum entropy models for antibody diversity , 2009, Proceedings of the National Academy of Sciences.

[9]  John C. Wooley,et al.  A Primer on Metagenomics , 2010, PLoS Comput. Biol..

[10]  A Colosimo,et al.  Special factors in biological strings. , 2000, Journal of theoretical biology.

[11]  Abbe Mowshowitz,et al.  Entropy and the complexity of graphs , 1967 .

[12]  Stéphane Grumbach,et al.  Compression of DNA sequences , 1993, [Proceedings] DCC `93: Data Compression Conference.

[13]  S. Salzberg,et al.  Phymm and PhymmBL: Metagenomic Phylogenetic Classification with Interpolated Markov Models , 2009, Nature Methods.

[14]  Russell J. Davenport,et al.  Removing Noise From Pyrosequenced Amplicons , 2011, BMC Bioinformatics.

[15]  Trevor I. Dix,et al.  A Simple Statistical Algorithm for Biological Sequence Compression , 2007, 2007 Data Compression Conference (DCC'07).

[16]  Lynn K. Carmichael,et al.  Evaluation of 16S rDNA-Based Community Profiling for Human Microbiome Research , 2012, PloS one.

[17]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[18]  U. Rajendra Acharya,et al.  Author's Personal Copy Biomedical Signal Processing and Control Automated Diagnosis of Epileptic Eeg Using Entropies , 2022 .

[19]  G. A. Hunt Markoff chains and Martin boundaries , 1960 .

[20]  Daniel H. Huson,et al.  MetaSim—A Sequencing Simulator for Genomics and Metagenomics , 2008, PloS one.

[21]  Alexey A. Gurevich,et al.  MetaQUAST: evaluation of metagenome assemblies , 2016, Bioinform..

[22]  B. Roe,et al.  A core gut microbiome in obese and lean twins , 2008, Nature.

[23]  Riaz A. Usmani,et al.  Inversion of Jacobi's tridiagonal matrix , 1994 .

[24]  Benjamin Weiss,et al.  Entropy is the Only Finitely Observable Invariant , 2006 .

[25]  François Laviolette,et al.  Ray: Simultaneous Assembly of Reads from a Mix of High-Throughput Sequencing Technologies , 2010, J. Comput. Biol..

[26]  T. Miki,et al.  Intraguild predation promotes complex alternative states along a productivity gradient. , 2007, Theoretical population biology.

[27]  Daniel B. Stouffer,et al.  Evidence for the existence of a robust pattern of prey selection in food webs , 2007, Proceedings of the Royal Society B: Biological Sciences.

[28]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[29]  Changiz Eslahchi,et al.  The performances of the chi-square test and complexity measures for signal recognition in biological sequences. , 2008, Journal of theoretical biology.

[30]  S. Quake,et al.  Dissecting biological “dark matter” with single-cell genetic analysis of rare and uncultivated TM7 microbes from the human mouth , 2007, Proceedings of the National Academy of Sciences.

[31]  D. Bajic,et al.  The flip-flop effect in entropy estimation , 2011, 2011 IEEE 9th International Symposium on Intelligent Systems and Informatics.

[32]  M. Nei,et al.  Prospects for inferring very large phylogenies by using the neighbor-joining method. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[33]  M. Yandell,et al.  A beginner's guide to eukaryotic genome annotation , 2012, Nature Reviews Genetics.

[34]  P. Cury,et al.  Integrating the invisible fabric of nature into fisheries management , 2013, Proceedings of the National Academy of Sciences.

[35]  Florent E. Angly,et al.  Grinder: a versatile amplicon and shotgun sequence simulator , 2012, Nucleic acids research.

[36]  E. Koonin,et al.  Search for a 'Tree of Life' in the thicket of the phylogenetic forest , 2009, Journal of biology.

[37]  Francesca Chiaromonte,et al.  Insertions and deletions are male biased too: a whole-genome analysis in rodents. , 2004, Genome research.

[38]  Ruth Ley,et al.  Unravelling the effects of the environment and host genotype on the gut microbiome , 2011, Nature Reviews Microbiology.

[39]  Nathaniel H. Hunt,et al.  The Appropriate Use of Approximate Entropy and Sample Entropy with Short Data Sets , 2012, Annals of Biomedical Engineering.

[40]  M. Gerstein,et al.  PEMer: a computational framework with simulation-based error models for inferring genomic structural variants from massive paired-end sequencing data , 2009, Genome Biology.

[41]  Nick Goldman,et al.  RNAcode: robust discrimination of coding and noncoding regions in comparative sequence data. , 2011, RNA.

[42]  Thomas Rattei,et al.  High definition for systems biology of microbial communities: metagenomics gets genome-centric and strain-resolved. , 2016, Current opinion in biotechnology.

[43]  Leopoldo C. Cancio,et al.  Assessment of the Need to Perform Life-Saving Interventions Using Comprehensive Analysis of the Electrocardiogram and Artificial Neural Networks , 2010 .

[44]  Matthew Fraser,et al.  EBI metagenomics—a new resource for the analysis and archiving of metagenomic data , 2013, Nucleic Acids Res..

[45]  Alexander F. Auch,et al.  MEGAN analysis of metagenomic data. , 2007, Genome research.

[46]  H. Akashi,et al.  Gene expression and molecular evolution. , 2001, Current opinion in genetics & development.

[47]  D. Sandhu,et al.  Demarcating the gene-rich regions of the wheat genome. , 2004, Nucleic acids research.

[48]  J. Graves,et al.  Sex Chromosome Specialization and Degeneration in Mammals , 2006, Cell.

[49]  R Levins,et al.  DISCUSSION PAPER: THE QUALITATIVE ANALYSIS OF PARTIALLY SPECIFIED SYSTEMS , 1974, Annals of the New York Academy of Sciences.

[50]  W. Bossert,et al.  The Measurement of Diversity , 2001 .

[51]  Matthias Dehmer,et al.  A history of graph entropy measures , 2011, Inf. Sci..

[52]  Terrence S. Furey,et al.  The UCSC Table Browser data retrieval tool , 2004, Nucleic Acids Res..

[53]  M. Gilpin,et al.  Perturbation Experiments in Community Ecology: Theory and Practice , 1984 .

[54]  C. Harvey,et al.  Characterizing coastal foodwebs with qualitative links to bridge the gap between the theory and the practice of ecosystem-based management , 2014 .

[55]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[56]  C. Tsallis Possible generalization of Boltzmann-Gibbs statistics , 1988 .

[57]  Robin Milner,et al.  Definition of standard ML , 1990 .

[58]  D. Haussler,et al.  Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. , 2005, Genome research.

[59]  S Karlin,et al.  Codon usages in different gene classes of the Escherichia coli genome , 1998, Molecular microbiology.

[60]  E. Dynkin BOUNDARY THEORY OF MARKOV PROCESSES (THE DISCRETE CASE) , 1969 .

[61]  Eric S. Lander,et al.  Natural history of the infant gut microbiome and impact of antibiotic treatment on bacterial strain diversity and stability , 2015, Science Translational Medicine.

[62]  R. Ley,et al.  Ecological and Evolutionary Forces Shaping Microbial Diversity in the Human Intestine , 2006, Cell.

[63]  Chengyu Liu,et al.  Comparison of different threshold values r for approximate entropy: application to investigate the heart rate variability between heart failure and healthy control groups , 2011, Physiological measurement.

[64]  Austin G. Davis-Richardson,et al.  TaxCollector: Modifying Current 16S rRNA Databases for the Rapid Classification at Six Taxonomic Levels , 2010 .

[65]  Rayan Chikhi,et al.  Space-efficient and exact de Bruijn graph representation based on a Bloom filter , 2012, Algorithms for Molecular Biology.

[66]  Dorota Mackiewicz,et al.  Genome analyses and modelling the relationships between coding density, recombination rate and chromosome length. , 2010, Journal of theoretical biology.

[67]  P. Bork,et al.  ETE 3: Reconstruction, Analysis, and Visualization of Phylogenomic Data , 2016, Molecular biology and evolution.

[68]  E. Mardis,et al.  An obesity-associated gut microbiome with increased capacity for energy harvest , 2006, Nature.

[69]  Simon C. Potter,et al.  Genome-wide Association Analysis Identifies 14 New Risk Loci for Schizophrenia , 2013, Nature Genetics.

[70]  E. Birney,et al.  EGASP: the human ENCODE Genome Annotation Assessment Project , 2006, Genome Biology.

[71]  M. Borodovsky,et al.  GeneMark.hmm: new solutions for gene finding. , 1998, Nucleic acids research.

[72]  Dominique Gravel,et al.  The structure of probabilistic networks , 2016 .

[73]  Harry L. T. Mobley,et al.  Pathogenic Escherichia coli , 2004, Nature Reviews Microbiology.

[74]  E. Dejana,et al.  A gut-vascular barrier controls the systemic dissemination of bacteria , 2015, Science.

[75]  S. Evans,et al.  The phylogenetic Kantorovich–Rubinstein metric for environmental sequence samples , 2010, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[76]  Inge Jonassen,et al.  Characteristics of 454 pyrosequencing data—enabling realistic simulation with flowsim , 2010, Bioinform..

[77]  Pelin Yilmaz,et al.  The SILVA and “All-species Living Tree Project (LTP)” taxonomic frameworks , 2013, Nucleic Acids Res..

[78]  C. Angelini,et al.  The footprint of metabolism in the organization of mammalian genomes , 2012, BMC Genomics.

[79]  S. Schuster,et al.  Integrative analysis of environmental sequences using MEGAN4. , 2011, Genome research.

[80]  Carl Kingsford,et al.  A fast, lock-free approach for efficient parallel counting of occurrences of k-mers , 2011, Bioinform..

[81]  James R. Cole,et al.  The Ribosomal Database Project: improved alignments and new tools for rRNA analysis , 2008, Nucleic Acids Res..

[82]  Ohad Shamir,et al.  Accurate Profiling of Microbial Communities from Massively Parallel Sequencing Using Convex Optimization , 2013, SPIRE.

[83]  Po-E Li,et al.  Accurate read-based metagenome characterization using a hierarchical suite of unique signatures , 2015, Nucleic acids research.

[84]  Carlo Batini,et al.  Data Quality: Concepts, Methodologies and Techniques , 2006, Data-Centric Systems and Applications.

[85]  Bin Wang,et al.  Limitations of Compositional Approach to Identifying Horizontally Transferred Genes , 2001, Journal of Molecular Evolution.

[86]  R. Bowen Equilibrium States and the Ergodic Theory of Anosov Diffeomorphisms , 1975 .

[87]  C. Spencer,et al.  A contribution of novel CNVs to schizophrenia from a genome-wide study of 41,321 subjects: CNV Analysis Group and the Schizophrenia Working Group of the Psychiatric Genomics Consortium , 2016, bioRxiv.

[88]  Charles L. Lawson,et al.  Solving least squares problems , 1976, Classics in applied mathematics.

[89]  Doheon Lee,et al.  A Taxonomy of Dirty Data , 2004, Data Mining and Knowledge Discovery.

[90]  Konstantinos Karamanos,et al.  Statistical compressibility analysis of DNA sequences by generalized entropy-like quantities: Towards algorithmic laws for Biology? , 2006 .

[91]  S. Cebrat,et al.  PHASE TRANSITION IN THE GENOME EVOLUTION FAVORS NONRANDOM DISTRIBUTION OF GENES ON CHROMOSOMES , 2009, 0901.0990.

[92]  Simon Foucart,et al.  WGSQuikr: Fast Whole-Genome Shotgun Metagenomic Classification , 2014, PloS one.

[93]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[94]  Edward C. Posner,et al.  Random coding strategies for minimum entropy , 1975, IEEE Trans. Inf. Theory.

[95]  Andreas Holzinger Biomedical Informatics: Discovering Knowledge in Big Data , 2014 .

[96]  V. Chandar A Negative Result Concerning Explicit Matrices With The Restricted Isometry Property , 2008 .

[97]  K. Chon,et al.  Approximate entropy for all signals , 2009, IEEE Engineering in Medicine and Biology Magazine.

[98]  Igor Jurisica,et al.  Knowledge Discovery and interactive Data Mining in Bioinformatics - State-of-the-Art, future challenges and research directions , 2014, BMC Bioinformatics.

[99]  Mark Daniel Ward,et al.  On Correlation Polynomials and Subword Complexity , 2007 .

[100]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[101]  J. Gilbert,et al.  Recovering complete and draft population genomes from metagenome datasets , 2016, Microbiome.

[102]  Mark Novak,et al.  Bayesian characterization of uncertainty in species interaction strengths , 2017, Oecologia.

[103]  Ben Raymond,et al.  Comprehensive evaluation of model uncertainty in qualitative network analyses , 2012 .

[104]  M. Queffélec Substitution dynamical systems, spectral analysis , 1987 .

[105]  Anamitra Makur,et al.  Backtracking-Based Matching Pursuit Method for Sparse Signal Reconstruction , 2011, IEEE Signal Processing Letters.

[106]  Paul M. B. Vitányi,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 1993, Graduate Texts in Computer Science.

[107]  Se Jin Song,et al.  Tracking down the sources of experimental contamination in microbiome studies , 2014, Genome Biology.

[108]  A. Mchardy,et al.  The PhyloPythiaS Web Server for Taxonomic Assignment of Metagenome Sequences , 2012, PloS one.

[109]  Anders F. Andersson,et al.  A pyrosequencing study in twins shows that gastrointestinal microbial profiles vary with inflammatory bowel disease phenotypes. , 2010, Gastroenterology.

[110]  Jeffrey M. Dambacher,et al.  Qualitative modelling of invasive species eradication on subantarctic Macquarie Island , 2011 .

[111]  Trevor I. Dix,et al.  Comparative analysis of long DNA sequences by per element information content using different contexts , 2007, BMC Bioinformatics.

[112]  Donald A. Adjeroh,et al.  On complexity measures for biological sequences , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[113]  Alice C. McHardy,et al.  Functional overlap of the Arabidopsis leaf and root microbiota , 2015, Nature.

[114]  Gad M. Landau,et al.  Sequence complexity profiles of prokaryotic genomic sequences: A fast algorithm for calculating linguistic complexity , 2002, Bioinform..

[115]  H E Stanley,et al.  Scaling features of noncoding DNA. , 1999, Physica A.

[116]  D. Falush,et al.  A Genetic Atlas of Human Admixture History , 2014, Science.

[117]  Jonas S. Almeida,et al.  Local Renyi entropic profiles of DNA sequences , 2007, BMC Bioinformatics.

[118]  M. Pirinen,et al.  The fine-scale genetic structure of the British population , 2015, Nature.

[119]  Rob Knight,et al.  UCHIME improves sensitivity and speed of chimera detection , 2011, Bioinform..

[120]  L. Lawlor Direct and indirect effects of n-species competition , 1979, Oecologia.

[121]  Eriko Hoshino,et al.  Modelling marine community responses to climate‐driven species redistribution to guide monitoring and adaptive ecosystem‐based management , 2016, Global change biology.

[122]  Si Tang,et al.  Stability criteria for complex ecosystems , 2011, Nature.

[123]  S M Pincus,et al.  Approximate entropy as a measure of system complexity. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[124]  H. J. Jeffrey Chaos game representation of gene structure. , 1990, Nucleic acids research.

[125]  R. Whittaker Evolution and measurement of species diversity , 1972 .

[126]  Peter Beike,et al.  The Definition Of Standard Ml Revised , 2016 .

[127]  R. Knight,et al.  Quantitative and Qualitative β Diversity Measures Lead to Different Insights into Factors That Structure Microbial Communities , 2007, Applied and Environmental Microbiology.

[128]  Yu-Chieh Liao,et al.  Accurate binning of metagenomic contigs via automated clustering sequences using information of genomic signatures and marker genes , 2016, Scientific Reports.

[129]  Andreas Holzinger,et al.  Selection of entropy-measure parameters for knowledge discovery in heart rate variability data , 2014, BMC Bioinformatics.

[130]  M. Novak,et al.  Complexity Increases Predictability in Allometrically Constrained Food Webs , 2016, The American Naturalist.

[131]  Zhiheng Pei,et al.  Pearls and pitfalls of genomics-based microbiome analysis , 2012, Emerging Microbes & Infections.

[132]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[133]  P. Wojtaszczyk,et al.  Stability and Instance Optimality for Gaussian Measurements in Compressed Sensing , 2010, Found. Comput. Math..

[134]  Paul Richardson,et al.  The DNA sequence and comparative analysis of human chromosome 5 , 2004, Nature.

[135]  Antonis Rokas,et al.  Inferring ancient divergences requires genes with strong phylogenetic signals , 2013, Nature.

[136]  Jonas S. Almeida,et al.  Rényi continuous entropy of DNA sequences. , 2004, Journal of theoretical biology.

[137]  Evan Bolton,et al.  Database resources of the National Center for Biotechnology Information , 2017, Nucleic Acids Res..

[138]  Roberto Hornero,et al.  Complex analysis of intracranial hypertension using approximate entropy* , 2006, Critical care medicine.

[139]  J. Richman,et al.  Physiological time-series analysis using approximate entropy and sample entropy. , 2000, American journal of physiology. Heart and circulatory physiology.

[140]  Michael J. Berry,et al.  Weak pairwise correlations imply strongly correlated network states in a neural population , 2005, Nature.

[141]  David Haussler,et al.  Sequence landscapes , 1986, Nucleic Acids Res..

[142]  Abel Torres,et al.  Interpretation of the approximate entropy using fixed tolerance values as a measure of amplitude variations in biomedical signals , 2010, 2010 Annual International Conference of the IEEE Engineering in Medicine and Biology.

[143]  Leonidas J. Guibas,et al.  The Earth Mover's Distance as a Metric for Image Retrieval , 2000, International Journal of Computer Vision.

[144]  D. Huson,et al.  Analysis of 16S rRNA environmental sequences using MEGAN , 2011, BMC Genomics.

[145]  H. Philippe,et al.  Impact of missing data on phylogenies inferred from empirical phylogenomic data sets. , 2013, Molecular biology and evolution.

[146]  Eric Vigoda,et al.  A polynomial-time approximation algorithm for the permanent of a matrix with nonnegative entries , 2004, JACM.

[147]  John A. Nelder,et al.  A Simplex Method for Function Minimization , 1965, Comput. J..

[148]  Franco Blanchini,et al.  Computing the structural influence matrix for biological systems , 2016, Journal of mathematical biology.

[149]  W. Bialek,et al.  Statistical mechanics for natural flocks of birds , 2011, Proceedings of the National Academy of Sciences.

[150]  Trevor I. Dix,et al.  Sequence Complexity for Biological Sequence Analysis , 2000, Comput. Chem..

[151]  S. Jeffery Evolution of Protein Molecules , 1979 .

[152]  A. Maass,et al.  Topics in symbolic dynamics and applications , 2000 .

[153]  Annette Ostling,et al.  Sensitivity analysis of coexistence in ecological communities: theory and application. , 2014, Ecology letters.

[154]  Doug Hyatt,et al.  Enigmatic, ultrasmall, uncultivated Archaea , 2010, Proceedings of the National Academy of Sciences.

[155]  Gail L. Rosen,et al.  Quikr: a method for rapid reconstruction of bacterial communities via compressive sensing , 2013, Bioinform..

[156]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[157]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[158]  I. Daubechies,et al.  An iterative thresholding algorithm for linear inverse problems with a sparsity constraint , 2003, math/0307152.

[159]  Maya Gokhale,et al.  Scalable metagenomic taxonomy classification using a reference genome database , 2013, Bioinform..

[160]  R. Knight,et al.  Fast UniFrac: Facilitating high-throughput phylogenetic analyses of microbial communities including analysis of pyrosequencing and PhyloChip data , 2009, The ISME Journal.

[161]  Enrique Blanco,et al.  Using geneid to Identify Genes , 2002, Current protocols in bioinformatics.

[162]  E.J. Candes,et al.  An Introduction To Compressive Sampling , 2008, IEEE Signal Processing Magazine.

[163]  S. Tringe,et al.  Quantitative Phylogenetic Assessment of Microbial Communities in Diverse Environments , 2007, Science.

[164]  R. Knight,et al.  UniFrac: a New Phylogenetic Method for Comparing Microbial Communities , 2005, Applied and Environmental Microbiology.

[165]  A. Nekrutenko,et al.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.

[166]  R. Knight,et al.  Accurate taxonomy assignments from 16S rRNA sequences produced by highly parallel pyrosequencers , 2008, Nucleic acids research.

[167]  Reuven Y. Rubinstein,et al.  Optimization of computer simulation models with rare events , 1997 .

[168]  S. Lonardi,et al.  CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers , 2015, BMC Genomics.

[169]  A. S.,et al.  Estimating the Entropy of DNA Sequences , 1997 .

[170]  Jean-Paul Delahaye,et al.  Detection of significant patterns by compression algorithms: the case of approximate tandem repeats in DNA sequences , 1997, Comput. Appl. Biosci..

[171]  Serghei Mangul,et al.  Reference-free comparison of microbial communities via de Bruijn graphs , 2016, bioRxiv.

[172]  The Martin entrance boundary of the Galton–Watson process , 2006 .

[173]  Feng Gao,et al.  Comparison of various algorithms for recognizing short coding sequences of human genes , 2004, Bioinform..

[174]  C. Vogel Computational Methods for Inverse Problems , 1987 .

[175]  S. Frick,et al.  Compressed Sensing , 2014, Computer Vision, A Reference Guide.

[176]  Niklas Krumm,et al.  One Codex: A Sensitive and Accurate Data Platform for Genomic Microbial Identification , 2015, bioRxiv.

[177]  Maxime Crochemore,et al.  Zones of Low Entropy in Genomic Sequences , 1999, Comput. Chem..

[178]  A. Szczepaniak,et al.  Comparative genomics of Lupinus angustifolius gene-rich regions: BAC library exploration, genetic mapping and cytogenetics , 2013, BMC Genomics.

[179]  Robert G. Beiko,et al.  Rapid identification of high-confidence taxonomic assignments for metagenomic data , 2012, Nucleic acids research.

[180]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[181]  S. Salzberg,et al.  PhymmBL expanded: confidence scores, custom databases, parallelization and more , 2011, Nature Methods.

[182]  André Calero Valdez,et al.  On Graph Entropy Measures for Knowledge Discovery from Publication Network Data , 2013, CD-ARES.

[183]  Yunpeng Cai,et al.  ESPRIT-Tree: hierarchical clustering analysis of millions of 16S rRNA pyrosequences in quasilinear computational time , 2011, Nucleic acids research.

[184]  Michael J. Berry,et al.  Ising models for networks of real neurons , 2006, q-bio/0611072.

[185]  Stephen Weeks,et al.  Whole-program compilation in MLton , 2006, ML '06.

[186]  Roy L. Adler,et al.  Topological entropy , 2008, Scholarpedia.

[187]  Piotr Indyk,et al.  Combining geometry and combinatorics: A unified approach to sparse signal recovery , 2008, 2008 46th Annual Allerton Conference on Communication, Control, and Computing.

[188]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[189]  C. Huttenhower,et al.  Metagenomic microbial community profiling using unique clade-specific marker genes , 2012, Nature Methods.

[190]  N. Kashtan,et al.  Single-Cell Genomics Reveals Hundreds of Coexisting Subpopulations in Wild Prochlorococcus , 2014, Science.

[191]  Vladimir D. Gusev,et al.  On the complexity measures of genetic sequences , 1999, Bioinform..

[192]  H. Philippe,et al.  Resolving Difficult Phylogenetic Questions: Why More Sequences Are Not Enough , 2011, PLoS biology.

[193]  Gail L. Rosen,et al.  Metagenome Fragment Classification Using N-Mer Frequency Profiles , 2008, Adv. Bioinformatics.

[194]  Andreas Wilke,et al.  phylogenetic and functional analysis of metagenomes , 2022 .

[195]  H. Kishino,et al.  Dating of the human-ape splitting by a molecular clock of mitochondrial DNA , 2005, Journal of Molecular Evolution.

[196]  Paul P. Gardner,et al.  An evaluation of the accuracy and speed of metagenome analysis tools , 2015, Scientific Reports.

[197]  Mihai Pop,et al.  MetaPhyler: Taxonomic profiling for metagenomic sequences , 2010, 2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[198]  A. Turing On Computable Numbers, with an Application to the Entscheidungsproblem. , 1937 .

[199]  Riaz A. Usmani,et al.  Inversion of a tridiagonal jacobi matrix , 1994 .

[200]  Lin Yuan,et al.  Minimum entropy and information measure , 1998, IEEE Trans. Syst. Man Cybern. Part C.

[201]  Paul Flicek,et al.  Gene prediction: compare and CONTRAST , 2007, Genome Biology.

[202]  Wing-Kin Sung,et al.  Opera: Reconstructing Optimal Genomic Scaffolds with High-Throughput Paired-End Sequences , 2011, RECOMB.

[203]  Eoin L. Brodie,et al.  Greengenes, a Chimera-Checked 16S rRNA Gene Database and Workbench Compatible with ARB , 2006, Applied and Environmental Microbiology.

[204]  J. Timothy Wootton,et al.  Characterizing Species Interactions to Understand Press Perturbations: What Is the Community Matrix? , 2016 .

[205]  Yuko Sakurai,et al.  Gut Dysbiosis and Detection of “Live Gut Bacteria” in Blood of Japanese Patients With Type 2 Diabetes , 2014, Diabetes Care.

[206]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[207]  A. Wilm,et al.  Species Identification and Profiling of Complex Microbial Communities Using Shotgun Illumina Sequencing of 16S rRNA Amplicon Sequences , 2012, PloS one.

[208]  S. Carpenter,et al.  Anticipating Critical Transitions , 2012, Science.

[209]  J. Felsenstein CONFIDENCE LIMITS ON PHYLOGENIES: AN APPROACH USING THE BOOTSTRAP , 1985, Evolution; international journal of organic evolution.

[210]  S. Carpenter,et al.  Stability and Diversity of Ecosystems , 2007, Science.

[211]  Ohad Shamir,et al.  High-resolution microbial community reconstruction by integrating short reads from multiple 16S rRNA regions , 2013, Nucleic acids research.

[212]  R. Holt Predation, apparent competition, and the structure of prey communities. , 1977, Theoretical population biology.

[213]  Jerry W. Lewis Inversion of tridiagonal matrices , 1982 .

[214]  Tao Jiang,et al.  IsoLasso: A LASSO Regression Approach to RNA-Seq Based Transcriptome Assembly - (Extended Abstract) , 2011, RECOMB.

[215]  Dirk Merkel,et al.  Docker: lightweight Linux containers for consistent development and deployment , 2014 .

[216]  Gabriel Valiente,et al.  Flexible taxonomic assignment of ambiguous sequencing reads , 2011, BMC Bioinformatics.

[217]  Jaysheel D. Bhavsar,et al.  Metagenomics: Read Length Matters , 2008, Applied and Environmental Microbiology.

[218]  K. Miller On the Inverse of the Sum of Matrices , 1981 .

[219]  Thomas Huber,et al.  Bellerophon: a program to detect chimeric sequences in multiple sequence alignments , 2004, Bioinform..

[220]  Jukka Corander,et al.  Bayesian estimation of bacterial community composition from 454 sequencing data , 2012, Nucleic acids research.

[221]  N. Pace,et al.  Molecular-phylogenetic characterization of microbial community imbalances in human inflammatory bowel diseases , 2007, Proceedings of the National Academy of Sciences.

[222]  Rudolf P. Rohr,et al.  On the structural stability of mutualistic systems , 2014, Science.

[223]  Kateryna D. Makova,et al.  A Macaque's-Eye View of Human Insertions and Deletions: Differences in Mechanisms , 2007, PLoS Comput. Biol..

[224]  Geoffrey R Hosack,et al.  Assessing model structure uncertainty through an analysis of system feedback and Bayesian networks. , 2008, Ecological applications : a publication of the Ecological Society of America.

[225]  O. Kirillova,et al.  Entropy concepts and DNA investigations , 2000, cond-mat/0008250.

[226]  Márcio Portes de Albuquerque,et al.  Image thresholding using Tsallis entropy , 2004, Pattern Recognit. Lett..

[227]  A. Mowshowitz,et al.  Entropy and the complexity of graphs. I. An index of the relative complexity of a graph. , 1968, The Bulletin of mathematical biophysics.

[228]  T. Sharpton An introduction to the analysis of shotgun metagenomic data , 2014, Front. Plant Sci..

[229]  Mikael Skoglund,et al.  SEK: sparsity exploiting k-mer-based estimation of bacterial community composition , 2014, Bioinform..

[230]  B. Tümmler,et al.  Genometa - A Fast and Accurate Classifier for Short Metagenomic Shotgun Reads , 2012, PloS one.

[231]  Mohammed AlQuraishi,et al.  Direct inference of protein–DNA interactions using compressed sensing methods , 2011, Proceedings of the National Academy of Sciences.

[232]  Andreas Holzinger,et al.  On Using Entropy for Enhancing Handwriting Preprocessing , 2012, Entropy.

[233]  Simon Foucart,et al.  Sparse Recovery by Means of Nonnegative Least Squares , 2014, IEEE Signal Processing Letters.

[234]  Lior Pachter,et al.  Pseudoalignment for metagenomic read assignment , 2015, Bioinform..

[235]  Isaac Y. Ho,et al.  Meraculous: De Novo Genome Assembly with Short Paired-End Reads , 2011, PloS one.

[236]  T I Dix,et al.  Discovering patterns in Plasmodium falciparum genomic DNA. , 2001, Molecular and biochemical parasitology.

[237]  Mikael Skoglund,et al.  Look ahead orthogonal matching pursuit , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[238]  Owen L. Petchey,et al.  The ecological forecast horizon, and examples of its uses and determinants , 2015, bioRxiv.

[239]  B. Haas,et al.  Chimeric 16S rRNA sequence formation and detection in Sanger and 454-pyrosequenced PCR amplicons. , 2011, Genome research.

[240]  Katherine H. Huang,et al.  A framework for human microbiome research , 2012, Nature.

[241]  A. Kolmogorov Three approaches to the quantitative definition of information , 1968 .

[242]  David Josiah Wing,et al.  Notions of complexity in substitution dynamical systems , 2011 .

[243]  Graziano Pesole,et al.  Statistical assessment of discriminative features for protein-coding and non coding cross-species conserved sequence elements , 2009, BMC Bioinformatics.

[244]  Dongwan D. Kang,et al.  Genome-wide selective sweeps and gene-specific sweeps in natural bacterial populations , 2016, The ISME Journal.

[245]  M. Mirzakhani,et al.  Introduction to Ergodic theory , 2010 .

[246]  Mark Emmerson,et al.  Predicting community responses to perturbations in the face of imperfect knowledge and network complexity. , 2011, Ecology.

[247]  J. Tiedje,et al.  Naïve Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy , 2007, Applied and Environmental Microbiology.

[248]  V. Prasolov Problems and theorems in linear algebra , 1994 .

[249]  Armando J. Pinho,et al.  On the Representability of Complete Genomes by Multiple Competing Finite-Context (Markov) Models , 2011, PloS one.

[250]  Manolis Kellis,et al.  PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions , 2011, Bioinform..

[251]  T. Ruiz-Argüeso,et al.  Bradyrhizobium valentinum sp. nov., isolated from effective nodules of Lupinus mariae-josephae, a lupine endemic of basic-lime soils in Eastern Spain. , 2014, Systematic and applied microbiology.

[252]  R. Mantegna,et al.  Systematic analysis of coding and noncoding DNA sequences using methods of statistical linguistics. , 1995, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[253]  Natalia N. Ivanova,et al.  Insights into the phylogeny and coding potential of microbial dark matter , 2013, Nature.

[254]  Robert M. Gray,et al.  An Algorithm for Vector Quantizer Design , 1980, IEEE Trans. Commun..

[255]  Melissa A. Wilson,et al.  Genomic analyses of sex chromosome evolution. , 2009, Annual review of genomics and human genetics.

[256]  Mikael Skoglund,et al.  Projection-Based and Look-Ahead Strategies for Atom Selection , 2011, IEEE Transactions on Signal Processing.

[257]  P. Bork,et al.  Durable coexistence of donor and recipient strains after fecal microbiota transplantation , 2016, Science.

[258]  S. Pincus Approximate entropy (ApEn) as a complexity measure. , 1995, Chaos.

[259]  Wangxin Yu,et al.  Characterization of Surface EMG Signal Based on Fuzzy Entropy , 2007, IEEE Transactions on Neural Systems and Rehabilitation Engineering.

[260]  Andrey Tovchigrechko,et al.  High-speed microbial community profiling , 2012, Nature Methods.

[261]  Amnon Amir,et al.  Bacterial Community Reconstruction Using Compressed Sensing , 2011, RECOMB.

[262]  Ian Korf,et al.  Gene finding in novel genomes , 2004, BMC Bioinformatics.

[263]  Qiong Wang,et al.  Using the RDP Classifier to Predict Taxonomic Novelty and Reduce the Search Space for Finding Novel Organisms , 2012, PloS one.

[264]  Gregory J. Chaitin,et al.  On the Length of Programs for Computing Finite Binary Sequences , 1966, JACM.

[265]  O. Dudko Statistical Mechanics: Entropy, Order Parameters, and Complexity , 2007 .

[266]  Francisco Guarner,et al.  The gut microbiota in IBD , 2012, Nature Reviews Gastroenterology &Hepatology.

[267]  Peter Meinicke,et al.  Mixture models for analysis of the taxonomic composition of metagenomes , 2011, Bioinform..

[268]  R. Guigó,et al.  EGASP: collaboration through competition to find human genes , 2005, Nature Methods.

[269]  J. Oliver,et al.  Entropic profiles of DNA sequences through chaos-game-derived images. , 1993, Journal of theoretical biology.

[270]  P. Yodzis The Indeterminacy of Ecological Interactions as Perceived Through Perturbation Experiments , 1988 .

[271]  L. Breiman Better subset regression using the nonnegative garrote , 1995 .

[272]  Congmao Wang,et al.  A novel compression tool for efficient storage of genome resequencing data , 2011, Nucleic acids research.

[273]  Khalid Sayood,et al.  A divide-and-conquer approach to fragment assembly , 2003, Bioinform..

[274]  E. Jaynes Information Theory and Statistical Mechanics , 1957 .

[275]  M. Emmerson,et al.  MEASUREMENT OF INTERACTION STRENGTH IN NATURE , 2005 .

[276]  Tong Zhang,et al.  Sparse Recovery With Orthogonal Matching Pursuit Under RIP , 2010, IEEE Transactions on Information Theory.

[277]  Jeffrey I. Gordon,et al.  Reciprocal Gut Microbiota Transplants from Zebrafish and Mice to Germ-free Recipients Reveal Host Habitat Selection , 2006, Cell.

[278]  M. Pop,et al.  Accurate and fast estimation of taxonomic profiles from metagenomic shotgun sequences , 2011, BMC Genomics.

[279]  Dingchang Zheng,et al.  Analysis of heart rate variability using fuzzy measure entropy , 2013, Comput. Biol. Medicine.

[280]  Benoist,et al.  On the Entropy of DNA: Algorithms and Measurements based on Memory and Rapid Convergence , 1994 .

[281]  Daniel J. Blankenberg,et al.  Galaxy: A Web‐Based Genome Analysis Tool for Experimentalists , 2010, Current protocols in molecular biology.

[282]  Armando J. Pinho,et al.  DNA Sequences at a Glance , 2013, PloS one.

[283]  N. Jesper Larsson Structures of String Matching and Data Compression , 1999 .

[284]  Daniel J. Blankenberg,et al.  Galaxy: a platform for interactive large-scale genome analysis. , 2005, Genome research.

[285]  Gene H. Golub,et al.  Generalized cross-validation as a method for choosing a good ridge parameter , 1979, Milestones in Matrix Computation.

[286]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[287]  P. Jaccard,et al.  Etude comparative de la distribution florale dans une portion des Alpes et des Jura , 1901 .

[288]  T. D. Schneider,et al.  Sequence logos: a new way to display consensus sequences. , 1990, Nucleic acids research.

[289]  Xin Chen,et al.  A compression algorithm for DNA sequences and its applications in genome comparison , 2000, RECOMB '00.

[290]  J. Handelsman,et al.  Introducing SONS, a Tool for Operational Taxonomic Unit-Based Comparisons of Microbial Community Memberships and Structures , 2006, Applied and Environmental Microbiology.

[291]  Robert C. Edgar,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2001 .

[292]  Jeffrey M. Dambacher,et al.  RELEVANCE OF COMMUNITY STRUCTURE IN ASSESSING INDETERMINACY OF ECOLOGICAL PREDICTIONS , 2002 .

[293]  William A. Walters,et al.  QIIME allows analysis of high-throughput community sequencing data , 2010, Nature Methods.

[294]  Folker Meyer,et al.  Rose: generating sequence families , 1998, Bioinform..

[295]  Kunihiko Sadakane,et al.  MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph , 2014, Bioinform..

[296]  P. Langridge,et al.  Genetic mapping and BAC assignment of EST-derived SSR markers shows non-uniform distribution of genes in the barley genome , 2006, Theoretical and Applied Genetics.

[297]  J W Fickett,et al.  Distinctive sequence features in protein coding genic non-coding, and intergenic human DNA. , 1995, Journal of molecular biology.

[298]  Armando J. Pinho,et al.  Exploring Homology Using the Concept of Three-State Entropy Vector , 2010, PRIB.

[299]  Daniel J. Blankenberg,et al.  A framework for collaborative analysis of ENCODE data: making large-scale analyses biologist-friendly. , 2007, Genome research.

[300]  Gail L. Rosen,et al.  NBC: the Naïve Bayes Classification tool webserver for taxonomic classification of metagenomic reads , 2010, Bioinform..

[301]  A. Geurts,et al.  Dynamical structure of center-of-pressure trajectories in patients recovering from stroke , 2006, Experimental Brain Research.

[302]  P. Lio’,et al.  Models of molecular evolution and phylogeny. , 1998, Genome research.

[303]  Dirk Steinke,et al.  Genome Desertification in Eutherians: Can Gene Deserts Explain the Uneven Distribution of Genes in Placental Mammalian Genomes? , 2009, Journal of Molecular Evolution.

[304]  N. Goldman,et al.  Nucleotide, dinucleotide and trinucleotide frequencies explain patterns observed in chaos game representations of DNA sequences. , 1993, Nucleic acids research.

[305]  R. Durbin,et al.  Vertebrate gene finding from multiple-species alignments using a two-level strategy , 2006, Genome Biology.

[306]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[307]  Alexander Bolshoy,et al.  Sequence Complexity and DNA Curvature , 1999, Comput. Chem..

[308]  Koichiro Tamura,et al.  MEGA6: Molecular Evolutionary Genetics Analysis version 6.0. , 2013, Molecular biology and evolution.

[309]  Dianne P. O'Leary,et al.  The Use of the L-Curve in the Regularization of Discrete Ill-Posed Problems , 1993, SIAM J. Sci. Comput..

[310]  Ute Dreher,et al.  Evolution In Changing Environments Some Theoretical Explorations , 2016 .

[311]  W. Parry,et al.  Zeta functions and the periodic orbit structure of hyperbolic dynamics , 1990 .

[312]  Chittibabu Guda,et al.  MetaID: A novel method for identification and quantification of metagenomic samples , 2013, BMC Genomics.

[313]  Mihai Pop,et al.  Statistical Methods for Detecting Differentially Abundant Features in Clinical Metagenomic Samples , 2009, PLoS Comput. Biol..

[314]  B L Maidak,et al.  The RDP-II (Ribosomal Database Project) , 2001, Nucleic Acids Res..

[315]  Joel A. Tropp,et al.  Signal Recovery From Random Measurements Via Orthogonal Matching Pursuit , 2007, IEEE Transactions on Information Theory.

[316]  Josep M. Comeron,et al.  An Evaluation of Measures of Synonymous Codon Usage Bias , 1998, Journal of Molecular Evolution.

[317]  Alexandros Stamatakis,et al.  Metagenomic species profiling using universal phylogenetic marker genes , 2013, Nature Methods.

[318]  R. Dennis Cook,et al.  Cross-Validation of Regression Models , 1984 .

[319]  L. Young Entropy in dynamical systems , 2003 .

[320]  Ray J. Solomonoff,et al.  A Formal Theory of Inductive Inference. Part I , 1964, Inf. Control..

[321]  Manolis Kellis,et al.  Performance and Scalability of Discriminative Metrics for Comparative Gene Identification in 12 Drosophila Genomes , 2008, PLoS Comput. Biol..

[322]  S. Strogatz Exploring complex networks , 2001, Nature.

[323]  C. M. da Fonseca,et al.  On the eigenvalues of some tridiagonal matrices , 2007 .

[324]  J. Fickett,et al.  Assessment of protein coding measures. , 1992, Nucleic acids research.

[325]  Selim Tuncel,et al.  Classification Problems in Ergodic Theory , 1982 .

[326]  V. Marx Microbiology: the road to strain-level identification , 2016, Nature Methods.

[327]  Amos Golan Information and Entropy Econometrics - A Review and Synthesis , 2008 .

[328]  David Koslicki,et al.  Topological entropy of DNA sequences , 2011, Bioinform..

[329]  Thippur V. Sreenivas,et al.  Optimum switched split vector quantization of LSF parameters , 2008, Signal Process..

[330]  Duy Tin Truong,et al.  MetaPhlAn2 for enhanced metagenomic taxonomic profiling , 2015, Nature Methods.

[331]  Michael Elad,et al.  On the Uniqueness of Nonnegative Sparse Solutions to Underdetermined Systems of Equations , 2008, IEEE Transactions on Information Theory.

[332]  Robert G. Beiko,et al.  Identifying biologically relevant differences between metagenomic communities , 2010, Bioinform..

[333]  Curtis Huttenhower,et al.  A Guide to Enterotypes across the Human Body: Meta-Analysis of Microbial Community Structures in Human Microbiome Datasets , 2013, PLoS Comput. Biol..

[334]  David Koslicki Substitution Markov chains with applications to molecular evolution , 2012 .

[335]  S. Evans,et al.  Trickle-down processes and their boundaries , 2010, 1010.0453.

[336]  V. Baladi Positive transfer operators and decay of correlations , 2000 .

[337]  Li Feng,et al.  Twelve open problems on the exact value of the Hausdorff measure and on topological entropy: a brief survey of recent results* , 2004 .

[338]  Paul Medvedev,et al.  On the representation of de Bruijn graphs , 2014, RECOMB.

[339]  Kateryna D. Makova,et al.  Evolution and Survival on Eutherian Sex Chromosomes , 2009, PLoS genetics.

[340]  S. Tringe,et al.  Tackling soil diversity with the assembly of large, complex metagenomes , 2014, Proceedings of the National Academy of Sciences.

[341]  A. J. Jones,et al.  At Least 1 in 20 16S rRNA Sequence Records Currently Held in Public Repositories Is Estimated To Contain Substantial Anomalies , 2005, Applied and Environmental Microbiology.