Parameter estimation for robust HMM analysis of ChIP-chip data

BackgroundTiling arrays are an important tool for the study of transcriptional activity, protein-DNA interactions and chromatin structure on a genome-wide scale at high resolution. Although hidden Markov models have been used successfully to analyse tiling array data, parameter estimation for these models is typically ad hoc. Especially in the context of ChIP-chip experiments, no standard procedures exist to obtain parameter estimates from the data. Common methods for the calculation of maximum likelihood estimates such as the Baum-Welch algorithm or Viterbi training are rarely applied in the context of tiling array analysis.ResultsHere we develop a hidden Markov model for the analysis of chromatin structure ChIP-chip tiling array data, using t emission distributions to increase robustness towards outliers. Maximum likelihood estimates are used for all model parameters. Two different approaches to parameter estimation are investigated and combined into an efficient procedure.ConclusionWe illustrate an efficient parameter estimation procedure that can be used for HMM based methods in general and leads to a clear increase in performance when compared to the use of ad hoc estimates. The resulting hidden Markov model outperforms established methods like TileMap in the context of histone modification studies.

[1]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[2]  E. Furlong,et al.  A core transcriptional network for early mesoderm development in Drosophila melanogaster. , 2007, Genes & development.

[3]  Biing-Hwang Juang,et al.  The segmental K-means algorithm for estimating parameters of hidden Markov models , 1990, IEEE Trans. Acoust. Speech Signal Process..

[4]  M. Pellegrini,et al.  Genome-wide High-Resolution Mapping and Functional Analysis of DNA Methylation in Arabidopsis , 2006, Cell.

[5]  Gordon K Smyth,et al.  Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2004, Statistical applications in genetics and molecular biology.

[6]  Wolfgang Huber,et al.  Transcript mapping with high-density oligonucleotide tiling arrays , 2006, Bioinform..

[7]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[8]  Thomas E. Royce,et al.  Global Identification of Human Transcribed Sequences with Genome Tiling Arrays , 2004, Science.

[9]  Maurice K. Wong,et al.  Algorithm AS136: A k-means clustering algorithm. , 1979 .

[10]  Sündüz Keleş,et al.  Mixture Modeling for Genome‐Wide Localization of Transcription Factors , 2007, Biometrics.

[11]  Eric S. Lander,et al.  Genomic Maps and Comparative Analysis of Histone Modifications in Human and Mouse , 2005, Cell.

[12]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[13]  Marc T. Facciotti,et al.  Model-based deconvolution of genome-wide DNA binding , 2008, Bioinform..

[14]  S. Cawley,et al.  Unbiased Mapping of Transcription Factor Binding Sites along Human Chromosomes 21 and 22 Points to Widespread Regulation of Noncoding RNAs , 2004, Cell.

[15]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[16]  Jeremy MG Taylor,et al.  Robust Statistical Modeling Using the t Distribution , 1989 .

[17]  Paul P. Gardner,et al.  A hidden Markov model approach for determining expression from genomic tiling micro arrays , 2006, BMC Bioinformatics.

[18]  Mark Gerstein,et al.  Bioinformatics Original Paper a Supervised Hidden Markov Model Framework for Efficiently Segmenting Tiling Array Data in Transcriptional and Chip-chip Experiments: Systematically Incorporating Validated Biological Knowledge , 2022 .

[19]  Wing Hung Wong,et al.  TileMap: create chromosomal map of tiling array hybridizations , 2005, Bioinform..

[20]  Korbinian Strimmer,et al.  Statistical Applications in Genetics and Molecular Biology , 2005 .

[21]  Clifford A. Meyer,et al.  A hidden Markov model for analyzing ChIP-chip experiments on genome tiling arrays and its application to p53 binding sequences , 2005, ISMB.

[22]  Kazuo Shinozaki,et al.  Tiling array-driven elucidation of transcriptional structures based on maximum-likelihood and Markov models. , 2005, The Plant journal : for cell and molecular biology.

[23]  Hongkai Ji,et al.  A comparative analysis of genome-wide chromatin immunoprecipitation data for mammalian transcription factors , 2006, Nucleic acids research.

[24]  Matteo Pellegrini,et al.  Whole-Genome Analysis of Histone H3 Lysine 27 Trimethylation in Arabidopsis , 2007, PLoS biology.

[25]  Geoffrey J. McLachlan,et al.  Robust mixture modelling using the t distribution , 2000, Stat. Comput..

[26]  Van Nostrand,et al.  Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm , 1967 .

[27]  D. Rubin,et al.  ML ESTIMATION OF THE t DISTRIBUTION USING EM AND ITS EXTENSIONS, ECM AND ECME , 1999 .