论文信息 - Mining K-mers of Various Lengths in Biological Sequences - 字舞流文

Mining K-mers of Various Lengths in Biological Sequences

Counting the occurrence frequency of each k-mer in a biological sequence is an important step in many bioinformatics applications. However, most k-mer counting algorithms rely on a given k to produce single-length k-mers, which is inefficient for sequence analysis for different k. Moreover, existing k-mer counters focus more on DNA sequences and less on protein ones. In practice, the analysis of k-mers in protein sequences can provide substantial biological insights in structure, function and evolution. To this end, an efficient algorithm, called VLmer (Various Length k-mer mining), is proposed to mine k-mers of various lengths termed vl-mers via inverted-index technique, which is orders of magnitude faster than the conventional forward-index method. Moreover, to the best of our knowledge, VLmer is the first able to mine k-mers of various lengths in both DNA and protein sequences.

Xiaoqing Yu | Jingsong Zhang | Xiangtian Yu | Luonan Chen | Tao Zeng | Jianmei Guo | Weifeng Guo | Xiangtian Yu | Jingsong Zhang | Tao Zeng | Luonan Chen | Jianmei Guo | Xiaoqing Yu | Weifeng Guo

[1] Yinglin Wang,et al. Mining Contiguous Sequential Generators in Biological Sequences , 2016, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[2] S. Salzberg,et al. Centrifuge: rapid and sensitive classification of metagenomic sequences , 2016, bioRxiv.

[3] Szymon Grabowski,et al. Disk-based k-mer counting on a PC , 2012, BMC Bioinformatics.

[4] Fredrik Vannberg,et al. KAnalyze: a fast versatile pipelined K-mer toolkit , 2014, Bioinform..

[5] Lore Cloots,et al. Query-based biclustering of gene expression data using Probabilistic Relational Models , 2011, BMC Bioinformatics.

[6] Carl Kingsford,et al. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers , 2011, Bioinform..

[7] Alexander Sczyrba,et al. MeCorS: Metagenome-enabled error correction of single cell sequencing reads , 2016, Bioinform..

[8] Pedro Miramontes,et al. Diminishing return for increased Mappability with longer sequencing reads: implications of the k-mer distributions in the human genome , 2013, BMC Bioinformatics.

[9] Kwong-Sak Leung,et al. Discovering protein–DNA binding sequence patterns using association rule mining , 2010, Nucleic acids research.

[10] Dominique Lavenier,et al. DSK: k-mer counting with very low memory usage , 2013, Bioinform..

[11] S. Kurtz,et al. A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes , 2008, BMC Genomics.

[12] Páll Melsted,et al. Efficient counting of k-mers in DNA sequences using a bloom filter , 2011, BMC Bioinformatics.

[13] Slobodan Vucetic,et al. MS-kNN: protein function prediction by integrating multiple data sources , 2013, BMC Bioinformatics.

[14] Sebastian Deorowicz,et al. KMC 2: Fast and resource-frugal k-mer counting , 2014, Bioinform..

[15] Yinglin Wang,et al. Automatic Learning Common Definitional Patterns from Multi-domain Wikipedia Pages , 2014, 2014 IEEE International Conference on Data Mining Workshop.

[16] Burkhard Rost,et al. Evolutionary profiles improve protein-protein interaction prediction from sequence , 2015, Bioinform..

[17] Howard Ochman,et al. Sequence Conservation and Functional Constraint on Intergenic Spacers in Reduced Genomes of the Obligate Symbiont Buchnera , 2011, PLoS genetics.

[18] Hamidreza Chitsaz,et al. HyDA-Vista: towards optimal guided selection of k-mer size for sequence assembly , 2014, BMC Genomics.

[19] O. Troyanskaya,et al. Predicting effects of noncoding variants with deep learning–based sequence model , 2015, Nature Methods.

[20] Ting Yu,et al. Dynamic and Efficient Private Keyword Search over Inverted Index--Based Encrypted Data , 2016, ACM Trans. Internet Techn..

[21] Sanguthevar Rajasekaran,et al. KCMBT: a k-mer Counter based on Multiple Burst Trees , 2016, Bioinform..

[22] Yinglin Wang,et al. CCSpan: Mining closed contiguous sequential patterns , 2015, Knowl. Based Syst..

[23] Huanming Yang,et al. De novo assembly of human genomes with massively parallel short read sequencing. , 2010, Genome research.

[24] Klas Hatje,et al. Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches , 2014, Nucleic Acids Res..

[25] Alice Barkan,et al. RNA-binding specificity landscape of the pentatricopeptide repeat protein PPR10 , 2017, RNA.

[26] Yinglin Wang,et al. An interaction framework of service-oriented ontology learning , 2012, CIKM '12.