An Experiment in Learning the Language of Sequence Motifs: Sequence Logos vs. Finite-State Machines

Position weight matrices (PWMs) are the standard way to model binding site affinities in bioinformatics. However, they assume that symbol occurrences are position independent and, hence, they do not take into account symbols co-occurrence at different sequence positions. To address this problem, we propose to construct finite-state machines (FSMs) instead. A modified version of the Evidence-Driven State Merging (EDSM) heuristic is used to reduce the number of states as FSMs grow too quickly as a function of the number of sequences to reveal any useful structure. We tested our approach on sequence data for the transcription factor HNF4 and found out that the constructed FSMs provide small representations and an intuitive visualization. Furthermore, the FSM was better than PWMs at discriminating the positive and negative sequences in our data set.

[1]  Mikael Bodén,et al.  MEME Suite: tools for motif discovery and searching , 2009, Nucleic Acids Res..

[2]  Rodrigo Lopez,et al.  Clustal W and Clustal X version 2.0 , 2007, Bioinform..

[3]  E. Mark Gold,et al.  Complexity of Automaton Identification from Given Data , 1978, Inf. Control..

[4]  Juan M. Vaquerizas,et al.  Multiplexed massively parallel SELEX for characterization of human transcription factor binding specificities. , 2010, Genome research.

[5]  William Stafford Noble,et al.  Quantifying similarity between motifs , 2007, Genome Biology.

[6]  Damián López,et al.  A sufficient condition to polynomially compute a minimum separating DFA , 2016, Inf. Sci..

[7]  Markus Holzer,et al.  More on deterministic and nondeterministic finite cover automata , 2017, Theor. Comput. Sci..

[8]  Juan M. Vaquerizas,et al.  DNA-Binding Specificities of Human Transcription Factors , 2013, Cell.

[9]  T. D. Schneider,et al.  Use of the 'Perceptron' algorithm to distinguish translational initiation sites in E. coli. , 1982, Nucleic acids research.

[10]  T. D. Schneider,et al.  Sequence logos: a new way to display consensus sequences. , 1990, Nucleic acids research.

[11]  François Coste,et al.  Learning the Language of Biological Sequences , 2016 .

[12]  Barak A. Pearlmutter,et al.  Results of the Abbadingo One DFA Learning Competition and a New Evidence-Driven State Merging Algorithm , 1998, ICGI.

[13]  DANA ANGLUIN,et al.  On the Complexity of Minimum Inference of Regular Sets , 1978, Inf. Control..