Entropy-scaling search of massive biological data

Many data sets exhibit well-defined structure that can be exploited to design faster search tools, but it is not always clear when such acceleration is possible. Here we introduce a framework for similarity search based on characterizing a data set's entropy and fractal dimension. We prove that searching scales in time with metric entropy (number of covering hyperspheres), if the fractal dimension of the data set is low, and scales in space with the sum of metric entropy and information-theoretic entropy (randomness of the data). Using these ideas, we present accelerated versions of standard tools, with no loss in specificity and little loss in sensitivity, for use in three domains-high-throughput drug screening (Ammolite, 150x speedup), metagenomics (MICA, 3.5x speedup of DIAMOND (3700x BLASTX)), and protein structure search (esFragBag, 10x speedup of FragBag). Our framework can be used to achieve 'compressive omics,' and the general theory can be readily applied to data science problems outside of biology. Source code: http://gems.csail.mit.edu.

[1]  C. Huttenhower,et al.  Metagenomic microbial community profiling using unique clade-specific marker genes , 2012, Nature Methods.

[2]  Divyakant Agrawal,et al.  Vector approximation based indexing for non-uniform high dimensional data sets , 2000, CIKM '00.

[3]  M. David,et al.  Metagenomic analysis of a permafrost microbial community reveals a rapid response to thaw , 2011, Nature.

[4]  J. Banfield,et al.  Community structure and metabolism through reconstruction of microbial genomes from the environment , 2004, Nature.

[5]  Rachel S. G. Sealfon,et al.  Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak , 2014, Science.

[6]  Hans-Jörg Schek,et al.  A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.

[7]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[8]  Terence Tao Product set estimates for non-commutative groups , 2008, Comb..

[9]  Chao Xie,et al.  Fast and sensitive protein alignment using DIAMOND , 2014, Nature Methods.

[10]  Esko Ukkonen,et al.  Algorithms for Approximate String Matching , 1985, Inf. Control..

[11]  Yongan Zhao,et al.  RAPSearch2: a fast and memory-efficient protein similarity search tool for next-generation sequencing data , 2011, Bioinform..

[12]  Satu Elisa Schaeffer,et al.  Graph Clustering , 2017, Encyclopedia of Machine Learning and Data Mining.

[13]  N. Pace,et al.  Gastrointestinal microbiology enters the metagenomics era , 2008, Current opinion in gastroenterology.

[14]  Sergey Nepomnyachiy,et al.  Global view of the protein universe , 2014, Proceedings of the National Academy of Sciences.

[15]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[16]  Inbal Budowski-Tal,et al.  FragBag, an accurate representation of protein structure, retrieves structural neighbors from the entire PDB quickly and accurately , 2010, Proceedings of the National Academy of Sciences.

[17]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[18]  Bonnie Berger,et al.  Quality score compression improves genotyping accuracy , 2015, Nature Biotechnology.

[19]  Mona Singh,et al.  Computational solutions for omics data , 2013, Nature Reviews Genetics.

[20]  Pavel Zezula,et al.  M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[21]  Thomas C. Conway,et al.  Succinct data structures for assembling large genomes , 2010, Bioinform..

[22]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[23]  R. Levy,et al.  Simplified amino acid alphabets for protein fold recognition and implications for folding. , 2000, Protein engineering.

[24]  Rob Phillips,et al.  Reduced amino acid alphabets exhibit an improved sensitivity and selectivity in fold assignment , 2009, Bioinform..

[25]  Roberto Grossi,et al.  Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching , 2005, SIAM J. Comput..

[26]  G. Kiczales,et al.  Proceedings the , 1997 .

[27]  Guy Joseph Jacobson,et al.  Succinct static data structures , 1988 .

[28]  N Linial,et al.  ProtoMap: Automatic classification of protein sequences, a hierarchy of protein families, and local maps of the protein space , 1999, Proteins.

[29]  Nathan Linial,et al.  Recovering key biological constituents through sparse representation of gene expression , 2011, Bioinform..

[30]  Chao Xie,et al.  A poor man’s BLASTX—high-throughput metagenomic protein database search using PAUDA , 2013, Bioinform..

[31]  Mario Vento,et al.  An Improved Algorithm for Matching Large Graphs , 2001 .

[32]  Lenore Cowen,et al.  Compressive genomics for protein databases , 2013, Bioinform..

[33]  Scott D. Kahn On the Future of Genomic Data , 2011, Science.

[34]  David J. Wild,et al.  Grand challenges for cheminformatics , 2009, J. Cheminformatics.

[35]  Yanli Wang,et al.  PubChem: Integrated Platform of Small Molecules and Biological Activities , 2008 .

[36]  Charu C. Aggarwal,et al.  Graph Clustering , 2010, Encyclopedia of Machine Learning and Data Mining.

[37]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[38]  Pavel Zezula,et al.  Similarity Search - The Metric Space Approach , 2005, Advances in Database Systems.

[39]  Pavel Zezula,et al.  A cost model for similarity queries in metric spaces , 1998, PODS '98.

[40]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[41]  Jeffrey K. Uhlmann,et al.  Satisfying General Proximity/Similarity Queries with Metric Trees , 1991, Inf. Process. Lett..

[42]  P E Bourne,et al.  Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. , 1998, Protein engineering.

[43]  E. Jacoby,et al.  Chemogenomics: an emerging strategy for rapid target and drug discovery , 2004, Nature Reviews Genetics.

[44]  M. Levitt,et al.  Structural similarity of DNA-binding domains of bacteriophage repressors and the globin core , 1993, Current Biology.

[45]  Kenneth Falconer,et al.  Fractal Geometry: Mathematical Foundations and Applications , 1990 .

[46]  Sahil R. Kalra,et al.  Big Challenges? Big Data … , 2015 .

[47]  Uri Alon,et al.  Inferring biological tasks using Pareto analysis of high-dimensional data , 2015, Nature Methods.

[48]  B. Berger,et al.  Compressive genomics , 2012, Nature Biotechnology.

[49]  Eric J Alm,et al.  Host lifestyle affects human microbiota on daily timescales , 2014, Genome Biology.

[50]  S. Schuster,et al.  Integrative analysis of environmental sequences using MEGAN4. , 2011, Genome research.

[51]  Lenore Cowen,et al.  Matt: Local Flexibility Aids Protein Multiple Structure Alignment , 2008, PLoS Comput. Biol..

[52]  Jesse R. Zaneveld,et al.  Predictive functional profiling of microbial communities using 16S rRNA marker gene sequences , 2013, Nature Biotechnology.

[53]  Tatiana A. Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[54]  Horst Bunke,et al.  A graph distance metric based on the maximal common subgraph , 1998, Pattern Recognit. Lett..

[55]  D. Macfabe,et al.  Short-chain fatty acid fermentation products of the gut microbiome: implications in autism spectrum disorders , 2012, Microbial ecology in health and disease.

[56]  Xavier Llorà,et al.  Automated alphabet reduction method with evolutionary algorithms for protein structure prediction , 2007, GECCO '07.

[57]  M. Gerstein,et al.  The relationship between protein structure and function: a comprehensive survey with application to the yeast genome. , 1999, Journal of molecular biology.

[58]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[59]  Rainer Schrader,et al.  Small Molecule Subgraph Detector (SMSD) toolkit , 2009, J. Cheminformatics.

[60]  Piotr Indyk,et al.  Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality , 2012, Theory Comput..

[61]  N Linial,et al.  Global self-organization of all known protein sequences reveals inherent biological signatures. , 1997, Journal of molecular biology.

[62]  Hiroyuki Ogata,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 1999, Nucleic Acids Res..

[63]  Tao Jiang,et al.  A maximum common substructure-based algorithm for searching and predicting drug-like compounds , 2008, ISMB.

[64]  V. Marx Biology: The big challenges of big data , 2013, Nature.