The subclass approach for mutational spectrum analysis: application of the SEM algorithm.

Analysis and comparison of mutational spectra represents an important problem in molecular biology. To analyse a mutational spectra we apply an algorithm based on the SEM subclass approach (Simulation, Expectation, Maximization). The algorithm tries to classify the mutational sites according to different mutation probabilities, and each site should belong to one class. Each class is approximated by binomial distribution and thus any real mutational spectrum is regarded as a mixture of binomial distributions. The separation process runs iteratively. Each iteration includes the simulation, maximization and estimation procedures. To evaluate the quality of the classification results, the X2 test is used. The algorithm has been checked on random spectra with preset parameters and on real mutational spectra. As has been shown, 17 out of 19 analysed real mutational spectra can be divided into two or more classes of sites, of which one contains hotspots of mutation. For the G:C-->A:T mutational spectra induced by Sn1 alkylating mutagenes (11 spectra) the classification accuracy was 0.95. To test different site volumes, each Sn1-induced spectrum was divided into the G-->A and C-->T spectra. The classification accuracy for these spectra was 0.96. From the analysis of classification errors it is possible to suggest that at least part of them cannot be ascribed to the faults of the algorithm but are caused by some special features of the mutagenesis itself. The results of the real data are in good relation with existing knowledge. The approach we present is an attempt to formalize the concept of a "mutational hotspot". The program implementing the SEM algorithm is available on the Web server (http:/(/)www.itba.mi.cnr.it/webmutation).