Classifying synthetic and biological DNA sequences with side effect machines

Finite state machines are routinely used to efficiently recognize patterns in strings. The internal state structure of the machine is typically only of peripheral interest, appearing in algorithms only when the number of states is minimized in the interests of efficiency of execution or comparison. A side effect machine saves information about the internal transitions of the state machine. This record of internal state transitions forms an induced feature set for any string run through the side effect machine. In this study the number of times a machine passes though each state is used as a numerical feature set for classification. Finite state machines are trained with an evolutionary algorithm to produce feature sets that are very easy for an unsupervised learning algorithm, k-means clustering, to learn. The system is demonstrated on synthetic and biological data. The biological data are PCR-primers classified by their success at amplification. The parameters, number of states, population size, and mutation rates are explored to characterize their effect on performance. Side effect machines are found to be effective at recognizing classes of DNA sequence data.

[1]  Vladimir Batagelj,et al.  Data Science and Classification , 2006, Studies in Classification, Data Analysis, and Knowledge Organization.

[2]  Patrick S. Schnable,et al.  Training Finite State Classifiers to Improve PCR Primer Design , 2004 .

[3]  Dan Gusfield Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[4]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[5]  Michael F. Barnsley,et al.  Fractals everywhere , 1988 .

[6]  Wendy Ashlock,et al.  Using Very Small Population Sizes in Genetic Programming , 2006, 2006 IEEE International Conference on Evolutionary Computation.

[7]  Patrick S Schnable,et al.  Genetic Dissection of Intermated Recombinant Inbred Lines Using a New Genetic Map of Maize , 2006, Genetics.

[8]  Daniel Ashlock,et al.  Chaos automata: iterated function systems with memory , 2003 .

[9]  Daniel A. Ashlock,et al.  AMoEBA Image Segmentation: Modeling of Individual Voronoi Tessellations , 2006, 2006 IEEE International Conference on Evolutionary Computation.

[10]  Peter Nordin,et al.  Genetic programming - An Introduction: On the Automatic Evolution of Computer Programs and Its Applications , 1998 .

[11]  D. Ashlock,et al.  Side effect machines for sequence classification , 2008, 2008 Canadian Conference on Electrical and Computer Engineering.

[12]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[13]  Robert McEliece,et al.  The Theory of Information and Coding: Information theory , 2002 .

[14]  Alain Guénoche,et al.  Comparison of Distance Indices Between Partitions , 2006, Data Science and Classification.

[15]  L. Hubert,et al.  Comparing partitions , 1985 .