Fast Search of Sequences with Complex Symbol Correlations using Profile Context-Sensitive HMMS and Pre-Screening Filters

Recently, profile context-sensitive HMMs (profile-csHMMs) have been proposed which are very effective in modeling the common patterns and motifs in related symbol sequences. Profile-csHMMs are capable of representing long-range correlations between distant symbols, even when these correlations are entangled in a complicated manner. This makes profile-csHMMs an useful tool in computational biology, especially in modeling noncoding RNAs (ncRNAs) and finding new ncRNA genes. However, a profile-csHMM based search is quite slow, hence not practical for searching a large database. In this paper, we propose a practical scheme for making the search speed significantly faster without any degradation in the prediction accuracy. The proposed method utilizes a pre-screening filter based on a profile-HMM, which filters out most sequences that will not be predicted as a match by the original profile-csHMM. Experimental results show that the proposed approach can make the search speed eighty times faster.