DNA splice site detection: a comparison of specific and general methods

In an era when whole organism genomes are being routinely sequenced, the problem of gene finding has become a key issue on the road to understanding. For eukaryotic organisms a large part of locating the genes is accomplished by predicting the likely location of splice sites on a DNA strand. This problem of splice site location has been ap- proached using a number of machine learning or statistical methods tailored more or less specifically to the nature of the problem. Recently large margin classifiers and boosting methods have been found to give improvements over more traditional methods in a number of areas. Here we compare large margin classifiers (SVM and CMLS) and boosted decision trees with the three most common models used for splice site detection (WMM, WAM, and MDT). We find that the newer methods compare favorably in all cases and can yield significant improvement in some cases.