Mining K-mers of Various Lengths in Biological Sequences

Counting the occurrence frequency of each k-mer in a biological sequence is an important step in many bioinformatics applications. However, most k-mer counting algorithms rely on a given k to produce single-length k-mers, which is inefficient for sequence analysis for different k. Moreover, existing k-mer counters focus more on DNA sequences and less on protein ones. In practice, the analysis of k-mers in protein sequences can provide substantial biological insights in structure, function and evolution. To this end, an efficient algorithm, called VLmer (Various Length k-mer mining), is proposed to mine k-mers of various lengths termed vl-mers via inverted-index technique, which is orders of magnitude faster than the conventional forward-index method. Moreover, to the best of our knowledge, VLmer is the first able to mine k-mers of various lengths in both DNA and protein sequences.

[1]  Yinglin Wang,et al.  Mining Contiguous Sequential Generators in Biological Sequences , 2016, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[2]  S. Salzberg,et al.  Centrifuge: rapid and sensitive classification of metagenomic sequences , 2016, bioRxiv.

[3]  Szymon Grabowski,et al.  Disk-based k-mer counting on a PC , 2012, BMC Bioinformatics.

[4]  Fredrik Vannberg,et al.  KAnalyze: a fast versatile pipelined K-mer toolkit , 2014, Bioinform..

[5]  Lore Cloots,et al.  Query-based biclustering of gene expression data using Probabilistic Relational Models , 2011, BMC Bioinformatics.

[6]  Carl Kingsford,et al.  A fast, lock-free approach for efficient parallel counting of occurrences of k-mers , 2011, Bioinform..

[7]  Alexander Sczyrba,et al.  MeCorS: Metagenome-enabled error correction of single cell sequencing reads , 2016, Bioinform..

[8]  Pedro Miramontes,et al.  Diminishing return for increased Mappability with longer sequencing reads: implications of the k-mer distributions in the human genome , 2013, BMC Bioinformatics.

[9]  Kwong-Sak Leung,et al.  Discovering protein–DNA binding sequence patterns using association rule mining , 2010, Nucleic acids research.

[10]  Dominique Lavenier,et al.  DSK: k-mer counting with very low memory usage , 2013, Bioinform..

[11]  S. Kurtz,et al.  A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes , 2008, BMC Genomics.

[12]  Páll Melsted,et al.  Efficient counting of k-mers in DNA sequences using a bloom filter , 2011, BMC Bioinformatics.

[13]  Slobodan Vucetic,et al.  MS-kNN: protein function prediction by integrating multiple data sources , 2013, BMC Bioinformatics.

[14]  Sebastian Deorowicz,et al.  KMC 2: Fast and resource-frugal k-mer counting , 2014, Bioinform..

[15]  Yinglin Wang,et al.  Automatic Learning Common Definitional Patterns from Multi-domain Wikipedia Pages , 2014, 2014 IEEE International Conference on Data Mining Workshop.

[16]  Burkhard Rost,et al.  Evolutionary profiles improve protein-protein interaction prediction from sequence , 2015, Bioinform..

[17]  Howard Ochman,et al.  Sequence Conservation and Functional Constraint on Intergenic Spacers in Reduced Genomes of the Obligate Symbiont Buchnera , 2011, PLoS genetics.

[18]  Hamidreza Chitsaz,et al.  HyDA-Vista: towards optimal guided selection of k-mer size for sequence assembly , 2014, BMC Genomics.

[19]  O. Troyanskaya,et al.  Predicting effects of noncoding variants with deep learning–based sequence model , 2015, Nature Methods.

[20]  Ting Yu,et al.  Dynamic and Efficient Private Keyword Search over Inverted Index--Based Encrypted Data , 2016, ACM Trans. Internet Techn..

[21]  Sanguthevar Rajasekaran,et al.  KCMBT: a k-mer Counter based on Multiple Burst Trees , 2016, Bioinform..

[22]  Yinglin Wang,et al.  CCSpan: Mining closed contiguous sequential patterns , 2015, Knowl. Based Syst..

[23]  Huanming Yang,et al.  De novo assembly of human genomes with massively parallel short read sequencing. , 2010, Genome research.

[24]  Klas Hatje,et al.  Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches , 2014, Nucleic Acids Res..

[25]  Alice Barkan,et al.  RNA-binding specificity landscape of the pentatricopeptide repeat protein PPR10 , 2017, RNA.

[26]  Yinglin Wang,et al.  An interaction framework of service-oriented ontology learning , 2012, CIKM '12.