Modelling discrete longitudinal data using acyclic probabilistic finite automata

Acyclic probabilistic finite automata (APFA) constitute a rich family of models for discrete longitudinal data. An APFA may be represented as a directed multigraph, and embodies a set of context-specific conditional independence relations that may be read off the graph. A model selection algorithm to minimize a penalized likelihood criterion such as AIC or BIC is described. This algorithm is compared to one implemented in Beagle, a widely used program for processing genomic data, both in terms of rate of convergence to the true model as the sample size increases, and a goodness-of-fit measure assessed using cross-validation. The comparisons are based on three data sets, two from molecular genetics and one from social science. The proposed algorithm performs at least as well as the algorithm in Beagle in both respects. We introduce APFA as graphical models for discrete longitudinal data.We propose a novel model selection algorithm based on penalized likelihood.We compare its rate of convergence and goodness-of-fit to Beagle.We use data from molecular genetics and social science in the comparisons.Our algorithm performs as least as well or better than the algorithm in Beagle.

[1]  Francisco Casacuberta,et al.  Probabilistic finite-state machines - part I , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[3]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[4]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[5]  Stanley P. Azen,et al.  Computational Statistics and Data Analysis (CSDA) , 2006 .

[6]  A. Raftery,et al.  Estimation and Modelling Repeated Patterns in High Order Markov Chains with the Mixture Transition Distribution Model , 1994 .

[7]  John T. Kent,et al.  The underlying structure of nonnested hypothesis tests , 1986 .

[8]  Gilbert Ritschard,et al.  Analyzing and Visualizing State Sequences in R with TraMineR , 2011 .

[9]  David Edwards,et al.  Context-specific graphical models for discrete longitudinal data , 2013, 1311.5066.

[10]  B. Browning,et al.  Efficient multilocus association testing for whole genome association studies using localized haplotype clustering , 2007, Genetic epidemiology.

[11]  S. Christiansen,et al.  Genetic analysis of the obligate parasitic barley powdery mildew fungus based on RFLP and virulence loci , 1990, Theoretical and Applied Genetics.

[12]  Sharon R Browning,et al.  Multilocus association mapping using variable-length Markov chains. , 2006, American journal of human genetics.

[13]  P. Bühlmann,et al.  Variable Length Markov Chains: Methodology, Computing, and Software , 2004 .

[14]  D. Edwards Introduction to graphical modelling , 1995 .

[15]  Jim Q. Smith,et al.  Conditional independence and chain event graphs , 2008, Artif. Intell..

[16]  Craig Boutilier,et al.  Context-Specific Independence in Bayesian Networks , 1996, UAI.

[17]  Dana Ron,et al.  On the learnability and usage of acyclic probabilistic finite automata , 1995, COLT '95.

[18]  Olga G. Troyanskaya,et al.  BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm332 Data and text mining , 2022 .

[19]  Colin de la Higuera,et al.  Probabilistic DFA Inference using Kullback-Leibler Divergence and Minimality , 2000, ICML.

[20]  Brian D. Ripley,et al.  Pattern Recognition and Neural Networks , 1996 .

[21]  Francisco Casacuberta,et al.  Probabilistic finite-state machines - part II , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  P. Diggle,et al.  Analysis of Longitudinal Data. , 1997 .

[23]  D. Edwards Linkage analysis using loglinear models , 1992 .

[24]  Dana Ron,et al.  On the learnability and usage of acyclic probabilistic finite automata , 1995, COLT '95.

[25]  David Edwards,et al.  Modelling and visualizing fine-scale linkage disequilibrium structure , 2013, BMC Bioinformatics.

[26]  G. Molenberghs,et al.  Models for Discrete Longitudinal Data , 2005 .

[27]  B. Browning,et al.  Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. , 2007, American journal of human genetics.

[28]  José Oncina,et al.  Learning Stochastic Regular Grammars by Means of a State Merging Method , 1994, ICGI.

[29]  Peter Bühlmann,et al.  Model Selection for Variable Length Markov Chains and Tuning the Context Algorithm , 2000 .