论文信息 - Efficient Mining Multi-Mers in a Variety of Biological Sequences

Efficient Mining Multi-Mers in a Variety of Biological Sequences

Counting the occurrence frequency of each <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives><mml:math><mml:mi>k</mml:mi></mml:math><inline-graphic xlink:href="zhang-ieq1-2828313.gif"/></alternatives></inline-formula>-mer in a biological sequence is a preliminary yet important step in many bioinformatics applications. However, most <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives><mml:math><mml:mi>k</mml:mi></mml:math><inline-graphic xlink:href="zhang-ieq2-2828313.gif"/></alternatives></inline-formula>-mer counting algorithms rely on a given <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives><mml:math><mml:mi>k</mml:mi></mml:math><inline-graphic xlink:href="zhang-ieq3-2828313.gif"/></alternatives></inline-formula> to produce single-length <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives><mml:math><mml:mi>k</mml:mi></mml:math><inline-graphic xlink:href="zhang-ieq4-2828313.gif"/></alternatives></inline-formula>-mers, which is inefficient for sequence analysis for different <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives><mml:math><mml:mi>k</mml:mi></mml:math><inline-graphic xlink:href="zhang-ieq5-2828313.gif"/></alternatives></inline-formula>. Moreover, existing <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives><mml:math><mml:mi>k</mml:mi></mml:math><inline-graphic xlink:href="zhang-ieq6-2828313.gif"/></alternatives></inline-formula>-mer counters focus more on DNA and RNA sequences and less on protein ones. In practice, the analysis of <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives><mml:math><mml:mi>k</mml:mi></mml:math><inline-graphic xlink:href="zhang-ieq7-2828313.gif"/></alternatives></inline-formula>-mers in protein sequences can provide substantial biological insights in structure, function, and evolution. To this end, an efficient algorithm, called MulMer (<underline>Mul</underline>tiple-<underline>Mer</underline> mining), is proposed to mine <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives><mml:math><mml:mi>k</mml:mi></mml:math><inline-graphic xlink:href="zhang-ieq8-2828313.gif"/></alternatives></inline-formula>-mers of various lengths termed <italic>multi-mers</italic> via inverted-index technique, which is orders of magnitude faster than the conventional forward-index methods. Moreover, to the best of our knowledge, MulMer is the first able to mine multi-mers in a variety of sequences, including DNA, RNA, and protein sequences.

[1] Yinglin Wang,et al. Mining Contiguous Sequential Generators in Biological Sequences , 2016, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[2] Sanghamitra Bandyopadhyay,et al. A New Feature Vector Based on Gene Ontology Terms for Protein-Protein Interaction Prediction , 2017, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[3] Kazuyuki Aihara,et al. Quantifying critical states of complex diseases using single-sample dynamic network biomarkers , 2017, PLoS Comput. Biol..

[4] Fei Liu,et al. Inference of Gene Regulatory Network Based on Local Bayesian Networks , 2016, PLoS Comput. Biol..

[5] S. Salzberg,et al. Centrifuge: rapid and sensitive classification of metagenomic sequences , 2016, bioRxiv.

[6] Klas Hatje,et al. Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches , 2014, Nucleic Acids Res..

[7] Hamidreza Chitsaz,et al. HyDA-Vista: towards optimal guided selection of k-mer size for sequence assembly , 2014, BMC Genomics.

[8] Pedro Miramontes,et al. Diminishing return for increased Mappability with longer sequencing reads: implications of the k-mer distributions in the human genome , 2013, BMC Bioinformatics.

[9] Xiaoqing Yu,et al. Mining K-mers of Various Lengths in Biological Sequences , 2017, ISBRA.

[10] Alexander Sczyrba,et al. MeCorS: Metagenome-enabled error correction of single cell sequencing reads , 2016, Bioinform..

[11] Xingming Zhao,et al. Conditional mutual inclusive information enables accurate quantification of associations in gene regulatory networks , 2014, Nucleic acids research.

[12] Yinglin Wang,et al. Automatic Learning Common Definitional Patterns from Multi-domain Wikipedia Pages , 2014, 2014 IEEE International Conference on Data Mining Workshop.

[13] Burkhard Rost,et al. Evolutionary profiles improve protein-protein interaction prediction from sequence , 2015, Bioinform..

[14] Huanming Yang,et al. De novo assembly of human genomes with massively parallel short read sequencing. , 2010, Genome research.

[15] Carl Kingsford,et al. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers , 2011, Bioinform..

[16] Michael Hiller,et al. Iterative error correction of long sequencing reads maximizes accuracy and improves contig assembly , 2016, Briefings Bioinform..

[17] Gregory Kucherov,et al. Spaced seeds improve k-mer-based metagenomic classification , 2015, Bioinform..

[18] Meiyi Li,et al. Dynamic network biomarker indicates pulmonary metastasis at the tipping point of hepatocellular carcinoma , 2018, Nature Communications.

[19] Howard Ochman,et al. Sequence Conservation and Functional Constraint on Intergenic Spacers in Reduced Genomes of the Obligate Symbiont Buchnera , 2011, PLoS genetics.

[20] Chen Li,et al. Dysfunction of PLA2G6 and CYP2C44-associated network signals imminent carcinogenesis from chronic inflammation to hepatocellular carcinoma , 2017, Journal of molecular cell biology.

[21] Elmar Pruesse,et al. SINA: Accurate high-throughput multiple sequence alignment of ribosomal RNA genes , 2012, Bioinform..

[22] K. Aihara,et al. Personalized characterization of diseases using sample-specific networks , 2016, bioRxiv.

[23] Derrick E. Wood,et al. Kraken: ultrafast metagenomic sequence classification using exact alignments , 2014, Genome Biology.

[24] Sebastian Deorowicz,et al. KMC 2: Fast and resource-frugal k-mer counting , 2014, Bioinform..

[25] Fangfang Xia,et al. The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST) , 2013, Nucleic Acids Res..

[26] Yinglin Wang,et al. An interaction framework of service-oriented ontology learning , 2012, CIKM '12.

[27] Yiwei Thomas Hou,et al. Inverted index based multi-keyword public-key searchable encryption with strong privacy guarantee , 2015, 2015 IEEE Conference on Computer Communications (INFOCOM).

[28] S. Kurtz,et al. A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes , 2008, BMC Genomics.

[29] Kwong-Sak Leung,et al. Discovering protein–DNA binding sequence patterns using association rule mining , 2010, Nucleic acids research.

[30] N. Friedman,et al. Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data , 2011, Nature Biotechnology.

[31] S. Lonardi,et al. CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers , 2015, BMC Genomics.

[32] Yinglin Wang,et al. CCSpan: Mining closed contiguous sequential patterns , 2015, Knowl. Based Syst..

[33] Xingming Sun,et al. Effective and Efficient Global Context Verification for Image Copy Detection , 2017, IEEE Transactions on Information Forensics and Security.

[34] Steven L Salzberg,et al. DIAMUND: Direct Comparison of Genomes to Detect Mutations , 2013, Human mutation.

[35] Wanwei Zhang,et al. Discovering a critical transition state from nonalcoholic hepatosteatosis to nonalcoholic steatohepatitis by lipidomics and dynamical network biomarkers. , 2016, Journal of molecular cell biology.

[36] Di Jiang,et al. TEII: Topic enhanced inverted index for top-k document retrieval , 2015, Knowl. Based Syst..

[37] Páll Melsted,et al. Efficient counting of k-mers in DNA sequences using a bloom filter , 2011, BMC Bioinformatics.

[38] Yongjun Li,et al. Detecting critical state before phase transition of complex biological systems by hidden Markov model , 2016, Bioinform..

[39] Luonan Chen,et al. Part mutual information for quantifying direct associations in networks , 2016, Proceedings of the National Academy of Sciences.

[40] Szymon Grabowski,et al. Disk-based k-mer counting on a PC , 2012, BMC Bioinformatics.

[41] K. Aihara,et al. Early Diagnosis of Complex Diseases by Molecular Biomarkers, Network Biomarkers, and Dynamical Network Biomarkers , 2014, Medicinal research reviews.

[42] Tetsuya Hayashi,et al. Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads , 2014, Genome research.

[43] Dominique Lavenier,et al. DSK: k-mer counting with very low memory usage , 2013, Bioinform..

[44] Kazuyuki Aihara,et al. Detecting early-warning signals for sudden deterioration of complex diseases by dynamical network biomarkers , 2012, Scientific Reports.

[45] O. Troyanskaya,et al. Predicting effects of noncoding variants with deep learning–based sequence model , 2015, Nature Methods.

[46] Ting Yu,et al. Dynamic and Efficient Private Keyword Search over Inverted Index--Based Encrypted Data , 2016, ACM Trans. Internet Techn..

[47] Trygve Almøy,et al. Comparing K-mer based methods for improved classification of 16S sequences , 2015, BMC Bioinformatics.

[48] Xiangtian Yu,et al. Individual-specific edge-network analysis for disease prediction , 2017, Nucleic acids research.

[49] Mykola Pechenizkiy,et al. Speeding-Up Association Rule Mining With Inverted Index Compression , 2016, IEEE Transactions on Cybernetics.

[50] Sanguthevar Rajasekaran,et al. KCMBT: a k-mer Counter based on Multiple Burst Trees , 2016, Bioinform..

[51] Alice Barkan,et al. RNA-binding specificity landscape of the pentatricopeptide repeat protein PPR10 , 2017, RNA.

[52] Fredrik Vannberg,et al. KAnalyze: a fast versatile pipelined K-mer toolkit , 2014, Bioinform..

[53] Xiaoping Liu,et al. Diagnosing phenotypes of single-sample individuals by edge biomarkers. , 2015, Journal of molecular cell biology.