Efficient Identification of Timed Automata: Theory and practice

This thesis contains a study in a subfield of artificial intelligence, learning theory, machine learning, and statistics, known as system (or language) identification. System identification is concerned with constructing (mathematical) models from observations. Such a model is an intuitive description of a complex system. One of the main nice properties of models is that they can be visualized and inspected in order to provide insight into the different behaviors of a system. In addition, they can be used to perform different calculations, such as making predictions, analyzing properties, diagnosing errors, performing simulations, and many more. Models are therefore extremely useful tools for understanding, interpreting, and modifying different kinds of systems. Unfortunately, it can be very difficult to construct a model by hand. This thesis investigates the difficulty of automatically identifying models from observations. Observations of some process and its environment are given. These observations form sequences of events. Using system identification, we try to discover the logical structure underlying these event sequences. A well-known model of such a logical structure is the deterministic finite state automaton (DFA). A DFA is a language model. Hence, its identification (or inference) problem has been well studied in the grammatical inference field. Knowing this, we want to take an established method to learn a DFA and apply it to our event sequences. However, when observing a system there often is more information than just the sequence of symbols (events): the time at which these symbols occur is also available. A DFA can be used to model this time information implicitly. A disadvantage of such an approach is that it can result in an exponential blowup of both the input data and the resulting size of the model. In this thesis, we propose a different method that uses the time information directly in order to produce a timed model. We use a well-known DFA variant that includes the notion of time, called the timed automaton (TA). TAs are commonly used to model and reason about real-time systems. A TA models the timed information explicitly, i.e., using numbers. Because numbers use a binary representation of time, such an explicit representation can result in exponentially more compact models than an implicit representation. Therefore, also the time, space, and data required to identify TAs can be exponentially smaller than the time, space, and data required to identify DFAs. This efficiency argument is our main reason we are interested in identifying TAs. The work in this thesis makes four major contributions to the state-of-the-art on this topic: 1. It contains a thorough theoretical study of the complexity of identifying TAs from data. 2. It provides an algorithm for identifying a simple TA from labeled data, i.e., from event sequences for which it is known to which type of system behavior they belong. 3. It extends this algorithm to the setting of unlabeled data, i.e., from event sequences with unknown behaviors. 4. It shows how to apply this algorithm to the problem of identifying a real-time monitoring system. These contributions are of importance for anyone who is interested in identifying timed systems. Most importantly, both in our theoretical work and in our experiments we show that identifying a TA by using the time information directly is more efficient than identifying an equivalent DFA. In addition, our techniques can be applied to many interesting problems due to their generality. Examples are gaining insight into a real-time process, recognizing different process behaviors, identifying process models, and analyzing black-box systems.