Motif Recognition Using LVQ Classifiers with Overlap-based Similarity Metrics

Identifying locations and specificities of DNA-protein binding sites (also termed as motifs) is an important step towards understanding the mechanism of gene expressions. To save experimental cost and time, computational approaches have received increasing interest and demonstrated good potential for problem solving. Given a set of known motif instances associated with a transcription factor, motif recognition turns to be a biological data classification problem where the datasets demonstrate a remarkable imbalance property. This paper deals with a problem of single motif recognition using machine learning techniques. We first develop an overlap-based similarity metrics (OSIM) to compare DNA sub-sequences. As an application of the metrics to motif recognition, we then propose a motif recognition system that makes use of Learning Vector Quantization 1 (LVQ1) as a primary classifier. In the system, we replace the Euclidian norm of LVQ1 by OSIM and introduce corresponding modifications to the winning prototype update and classification process. The system is also integrated with a new sampling technique to handle the imbalance property of biological datasets. Finally, we examine the recognition capability of our motif recognition approach in comparison with P-Match and three well-known learner models, namely Neural Networks (NN), Support Vector Machine (SVM), and Learning Vector Quantization 1 (LVQ1). Experimental results show that with the support of OSIM and the sampling method, the learner models can produce high recall rates but quite low precision rates for the tested datasets.