ToPS: A Framework to Manipulate Probabilistic Models of Sequence Data

Discrete Markovian models can be used to characterize patterns in sequences of values and have many applications in biological sequence analysis, including gene prediction, CpG island detection, alignment, and protein profiling. We present ToPS, a computational framework that can be used to implement different applications in bioinformatics analysis by combining eight kinds of models: (i) independent and identically distributed process; (ii) variable-length Markov chain; (iii) inhomogeneous Markov chain; (iv) hidden Markov model; (v) profile hidden Markov model; (vi) pair hidden Markov model; (vii) generalized hidden Markov model; and (viii) similarity based sequence weighting. The framework includes functionality for training, simulation and decoding of the models. Additionally, it provides two methods to help parameter setting: Akaike and Bayesian information criteria (AIC and BIC). The models can be used stand-alone, combined in Bayesian classifiers, or included in more complex, multi-model, probabilistic architectures using GHMMs. In particular the framework provides a novel, flexible, implementation of decoding in GHMMs that detects when the architecture can be traversed efficiently.

[1]  Kenta Nakai,et al.  DBTSS provides a tissue specific dynamic view of Transcription Start Sites , 2009, Nucleic Acids Res..

[2]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[3]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[4]  S. Cawley,et al.  Phat--a gene finding program for Plasmodium falciparum. , 2001, Molecular and biochemical parasitology.

[5]  Peter Bühlmann,et al.  Variable Length Markov Chains: Methodology, Computing, and Software , 2004 .

[6]  Paul T. Groth,et al.  The ENCODE (ENCyclopedia Of DNA Elements) Project , 2004, Science.

[7]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[8]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[9]  John M. Greally,et al.  CG dinucleotide clustering is a species-specific property of the genome , 2007, Nucleic acids research.

[10]  M. Miyamoto,et al.  Sequence alignments and pair hidden Markov models using evolutionary history. , 2003, Journal of molecular biology.

[11]  Irmtraud M. Meyer,et al.  HMMConverter 1.0: a toolbox for hidden Markov models , 2009, Nucleic acids research.

[12]  M. Borodovsky,et al.  Gene identification in novel eukaryotic genomes by self-training algorithm , 2005, Nucleic acids research.

[13]  Mario Stanke,et al.  Gene prediction with a hidden Markov model and a new intron submodel , 2003, ECCB.

[14]  Harris A. Jaffee,et al.  Redefining CpG islands using hidden Markov models. , 2010, Biostatistics.

[15]  David Haussler,et al.  Improved splice site detection in Genie , 1997, RECOMB '97.

[16]  Mario Stanke,et al.  Gene prediction with a hidden Markov model , 2004 .

[17]  Ian Korf,et al.  Integrating genomic homology into gene structure prediction , 2001, ISMB.

[18]  William H. Majoros,et al.  Methods for computational gene prediction , 2007 .

[19]  Benjamin Georgi,et al.  The General Hidden Markov Model Library : Analyzing Systems with Unobservable States , 2004 .

[20]  JORMA RISSANEN,et al.  A universal data compression system , 1983, IEEE Trans. Inf. Theory.

[21]  Steven Salzberg,et al.  TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders , 2004, Bioinform..

[22]  Y. Guédon Estimating Hidden Semi-Markov Chains From Discrete Sequences , 2003 .

[23]  H. Akaike A new look at the statistical model identification , 1974 .

[24]  Gerton Lunter HMMoC - a compiler for hidden Markov models , 2007, Bioinform..

[25]  S. Salzberg,et al.  Microbial gene identification using interpolated Markov models. , 1998, Nucleic acids research.

[26]  André Yoshiaki Kashiwabara,et al.  Decreasing the number of false positives in sequence classification , 2010, BMC Genomics.

[27]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[28]  David Haussler,et al.  A Generalized Hidden Markov Model for the Recognition of Human Genes in DNA , 1996, ISMB.

[29]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[30]  Paul Levi,et al.  GENIO/scan - EST Guided Identification of Genes in Human Genomic DNA , 1998, German Conference on Bioinformatics.

[31]  Michael R. Brent,et al.  Eval: A software package for analysis of genome annotations , 2003, BMC Bioinformatics.

[32]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[33]  Michael Q. Zhang Computational prediction of eukaryotic protein-coding genes , 2002, Nature Reviews Genetics.

[34]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.