Efficient Mining Multi-Mers in a Variety of Biological Sequences

Counting the occurrence frequency of each <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives><mml:math><mml:mi>k</mml:mi></mml:math><inline-graphic xlink:href="zhang-ieq1-2828313.gif"/></alternatives></inline-formula>-mer in a biological sequence is a preliminary yet important step in many bioinformatics applications. However, most <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives><mml:math><mml:mi>k</mml:mi></mml:math><inline-graphic xlink:href="zhang-ieq2-2828313.gif"/></alternatives></inline-formula>-mer counting algorithms rely on a given <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives><mml:math><mml:mi>k</mml:mi></mml:math><inline-graphic xlink:href="zhang-ieq3-2828313.gif"/></alternatives></inline-formula> to produce single-length <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives><mml:math><mml:mi>k</mml:mi></mml:math><inline-graphic xlink:href="zhang-ieq4-2828313.gif"/></alternatives></inline-formula>-mers, which is inefficient for sequence analysis for different <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives><mml:math><mml:mi>k</mml:mi></mml:math><inline-graphic xlink:href="zhang-ieq5-2828313.gif"/></alternatives></inline-formula>. Moreover, existing <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives><mml:math><mml:mi>k</mml:mi></mml:math><inline-graphic xlink:href="zhang-ieq6-2828313.gif"/></alternatives></inline-formula>-mer counters focus more on DNA and RNA sequences and less on protein ones. In practice, the analysis of <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives><mml:math><mml:mi>k</mml:mi></mml:math><inline-graphic xlink:href="zhang-ieq7-2828313.gif"/></alternatives></inline-formula>-mers in protein sequences can provide substantial biological insights in structure, function, and evolution. To this end, an efficient algorithm, called MulMer (<underline>Mul</underline>tiple-<underline>Mer</underline> mining), is proposed to mine <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives><mml:math><mml:mi>k</mml:mi></mml:math><inline-graphic xlink:href="zhang-ieq8-2828313.gif"/></alternatives></inline-formula>-mers of various lengths termed <italic>multi-mers</italic> via inverted-index technique, which is orders of magnitude faster than the conventional forward-index methods. Moreover, to the best of our knowledge, MulMer is the first able to mine multi-mers in a variety of sequences, including DNA, RNA, and protein sequences.

[1]  Yinglin Wang,et al.  Mining Contiguous Sequential Generators in Biological Sequences , 2016, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[2]  Sanghamitra Bandyopadhyay,et al.  A New Feature Vector Based on Gene Ontology Terms for Protein-Protein Interaction Prediction , 2017, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[3]  Kazuyuki Aihara,et al.  Quantifying critical states of complex diseases using single-sample dynamic network biomarkers , 2017, PLoS Comput. Biol..

[4]  Fei Liu,et al.  Inference of Gene Regulatory Network Based on Local Bayesian Networks , 2016, PLoS Comput. Biol..

[5]  S. Salzberg,et al.  Centrifuge: rapid and sensitive classification of metagenomic sequences , 2016, bioRxiv.

[6]  Klas Hatje,et al.  Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches , 2014, Nucleic Acids Res..

[7]  Hamidreza Chitsaz,et al.  HyDA-Vista: towards optimal guided selection of k-mer size for sequence assembly , 2014, BMC Genomics.

[8]  Pedro Miramontes,et al.  Diminishing return for increased Mappability with longer sequencing reads: implications of the k-mer distributions in the human genome , 2013, BMC Bioinformatics.

[9]  Xiaoqing Yu,et al.  Mining K-mers of Various Lengths in Biological Sequences , 2017, ISBRA.

[10]  Alexander Sczyrba,et al.  MeCorS: Metagenome-enabled error correction of single cell sequencing reads , 2016, Bioinform..

[11]  Xingming Zhao,et al.  Conditional mutual inclusive information enables accurate quantification of associations in gene regulatory networks , 2014, Nucleic acids research.

[12]  Yinglin Wang,et al.  Automatic Learning Common Definitional Patterns from Multi-domain Wikipedia Pages , 2014, 2014 IEEE International Conference on Data Mining Workshop.

[13]  Burkhard Rost,et al.  Evolutionary profiles improve protein-protein interaction prediction from sequence , 2015, Bioinform..

[14]  Huanming Yang,et al.  De novo assembly of human genomes with massively parallel short read sequencing. , 2010, Genome research.

[15]  Carl Kingsford,et al.  A fast, lock-free approach for efficient parallel counting of occurrences of k-mers , 2011, Bioinform..

[16]  Michael Hiller,et al.  Iterative error correction of long sequencing reads maximizes accuracy and improves contig assembly , 2016, Briefings Bioinform..

[17]  Gregory Kucherov,et al.  Spaced seeds improve k-mer-based metagenomic classification , 2015, Bioinform..

[18]  Meiyi Li,et al.  Dynamic network biomarker indicates pulmonary metastasis at the tipping point of hepatocellular carcinoma , 2018, Nature Communications.

[19]  Howard Ochman,et al.  Sequence Conservation and Functional Constraint on Intergenic Spacers in Reduced Genomes of the Obligate Symbiont Buchnera , 2011, PLoS genetics.

[20]  Chen Li,et al.  Dysfunction of PLA2G6 and CYP2C44-associated network signals imminent carcinogenesis from chronic inflammation to hepatocellular carcinoma , 2017, Journal of molecular cell biology.

[21]  Elmar Pruesse,et al.  SINA: Accurate high-throughput multiple sequence alignment of ribosomal RNA genes , 2012, Bioinform..

[22]  K. Aihara,et al.  Personalized characterization of diseases using sample-specific networks , 2016, bioRxiv.

[23]  Derrick E. Wood,et al.  Kraken: ultrafast metagenomic sequence classification using exact alignments , 2014, Genome Biology.

[24]  Sebastian Deorowicz,et al.  KMC 2: Fast and resource-frugal k-mer counting , 2014, Bioinform..

[25]  Fangfang Xia,et al.  The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST) , 2013, Nucleic Acids Res..

[26]  Yinglin Wang,et al.  An interaction framework of service-oriented ontology learning , 2012, CIKM '12.

[27]  Yiwei Thomas Hou,et al.  Inverted index based multi-keyword public-key searchable encryption with strong privacy guarantee , 2015, 2015 IEEE Conference on Computer Communications (INFOCOM).

[28]  S. Kurtz,et al.  A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes , 2008, BMC Genomics.

[29]  Kwong-Sak Leung,et al.  Discovering protein–DNA binding sequence patterns using association rule mining , 2010, Nucleic acids research.

[30]  N. Friedman,et al.  Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data , 2011, Nature Biotechnology.

[31]  S. Lonardi,et al.  CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers , 2015, BMC Genomics.

[32]  Yinglin Wang,et al.  CCSpan: Mining closed contiguous sequential patterns , 2015, Knowl. Based Syst..

[33]  Xingming Sun,et al.  Effective and Efficient Global Context Verification for Image Copy Detection , 2017, IEEE Transactions on Information Forensics and Security.

[34]  Steven L Salzberg,et al.  DIAMUND: Direct Comparison of Genomes to Detect Mutations , 2013, Human mutation.

[35]  Wanwei Zhang,et al.  Discovering a critical transition state from nonalcoholic hepatosteatosis to nonalcoholic steatohepatitis by lipidomics and dynamical network biomarkers. , 2016, Journal of molecular cell biology.

[36]  Di Jiang,et al.  TEII: Topic enhanced inverted index for top-k document retrieval , 2015, Knowl. Based Syst..

[37]  Páll Melsted,et al.  Efficient counting of k-mers in DNA sequences using a bloom filter , 2011, BMC Bioinformatics.

[38]  Yongjun Li,et al.  Detecting critical state before phase transition of complex biological systems by hidden Markov model , 2016, Bioinform..

[39]  Luonan Chen,et al.  Part mutual information for quantifying direct associations in networks , 2016, Proceedings of the National Academy of Sciences.

[40]  Szymon Grabowski,et al.  Disk-based k-mer counting on a PC , 2012, BMC Bioinformatics.

[41]  K. Aihara,et al.  Early Diagnosis of Complex Diseases by Molecular Biomarkers, Network Biomarkers, and Dynamical Network Biomarkers , 2014, Medicinal research reviews.

[42]  Tetsuya Hayashi,et al.  Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads , 2014, Genome research.

[43]  Dominique Lavenier,et al.  DSK: k-mer counting with very low memory usage , 2013, Bioinform..

[44]  Kazuyuki Aihara,et al.  Detecting early-warning signals for sudden deterioration of complex diseases by dynamical network biomarkers , 2012, Scientific Reports.

[45]  O. Troyanskaya,et al.  Predicting effects of noncoding variants with deep learning–based sequence model , 2015, Nature Methods.

[46]  Ting Yu,et al.  Dynamic and Efficient Private Keyword Search over Inverted Index--Based Encrypted Data , 2016, ACM Trans. Internet Techn..

[47]  Trygve Almøy,et al.  Comparing K-mer based methods for improved classification of 16S sequences , 2015, BMC Bioinformatics.

[48]  Xiangtian Yu,et al.  Individual-specific edge-network analysis for disease prediction , 2017, Nucleic acids research.

[49]  Mykola Pechenizkiy,et al.  Speeding-Up Association Rule Mining With Inverted Index Compression , 2016, IEEE Transactions on Cybernetics.

[50]  Sanguthevar Rajasekaran,et al.  KCMBT: a k-mer Counter based on Multiple Burst Trees , 2016, Bioinform..

[51]  Alice Barkan,et al.  RNA-binding specificity landscape of the pentatricopeptide repeat protein PPR10 , 2017, RNA.

[52]  Fredrik Vannberg,et al.  KAnalyze: a fast versatile pipelined K-mer toolkit , 2014, Bioinform..

[53]  Xiaoping Liu,et al.  Diagnosing phenotypes of single-sample individuals by edge biomarkers. , 2015, Journal of molecular cell biology.