A fast sorting algorithm for aptamer identification using deep sequencing

In recent years, with the advent of fast sequencing technology, the genomic database is growing rapidly. Researchers in the bioinformatics field are expecting faster and more accurate tools to effectively analyze the gigantic data sets. In the context of aptamer search, the goal is to search for the over-represented DNA sequences from the randomly generated aptamer libraries. Hash functions are widely used in substring comparison, sequence alignment and clustering tools. We have developed a light-weight tool that takes advantage of the hash functions to reduce the size of genomic data and conducts η-neighbor searches on the centroid sequence. This greatly improves the efficiency of the search compared with existing tools. Furthermore, the prior calculation of hash values of η-neighbors decreases the searching overhead. In a dataset of 2.23 million sequences, the proposed algorithm accurately count the frequency of the Human α-Thrombin aptamer sequences in less than 40 seconds, whereas the current script-based method takes 2 hours and 18 minutes.