Applications of Hidden Markov Models in Microarray Gene Expression Data

Hidden Markov models (HMMs) are well developed statistical models to capture hidden information from observable sequential symbols. They were first used in speech recognition in 1970s and have been successfully applied to the analysis of biological sequences since late 1980s as in finding protein secondary structure, CpG islands and families of related DNA or protein sequences [1]. In a HMM, the system being modeled is assumed to be a Markov process with unknown parameters, and the challenge is to determine the hidden parameters from the observable parameters. In this chapter, we described two applications using HMMs to predict gene functions in yeast and DNA copy number alternations in human tumor cells, based on gene expression microarray data. The first application employed HMMs as a gene function prediction tool to infer budding yeast Saccharomyces cerevisiae gene function from time-series microarray gene expression data. The sequential observations in HMM were the discretized expression measurements at each time point for the genes from the time-series microarray experiments. Yeast is an excellent model organism which has reasonably simple genome structure, well characterized gene functions, and huge expression data sets. A wide variety of data mining methods have been applied for inferring yeast gene functions from gene expression data sets, such as Decision Tree, Artificial Neural Networks, Support Vector Machines (SVMs) and K-Nearest Neighbors (KNNs) [2-4]. However those methods achieved only about 40% prediction precision in function prediction of un-annotated genes [2-4]. Based on our observations, there are three main reasons for the low prediction performance. First, the computational models are too simple to address the systematic variations of biological systems. One assumption is that genes from the same function class will show a similar expression pattern. However, clustering results have shown that functions and clusters have many-to-many relationship and it is often difficult to assign a function to an expression pattern (Eisen et al., supplementary data) [5]. Second, the measurements of expression value are generally not very accurate and show experimental errors (or noise). The observed expression values may not reflect the real expression levels of genes. For example, a correlation as low as 60% was reported between measurements of the same sample hybridized to two slides [6]. Third, none of the above methods explicitly address the less obvious but significant correlation of gene expressions. Our results indicate that the expression value of a gene depends significantly on its previous expression value. Therefore, Markov property can be assumed to simplify the non-independence of gene

[1]  Dmitrij Frishman,et al.  MIPS: a database for genomes and protein sequences , 2000, Nucleic Acids Res..

[2]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Jason Weston,et al.  Gene functional classification from heterogeneous data , 2001, RECOMB.

[4]  Christian A. Rees,et al.  Microarray analysis reveals a major direct role of DNA copy number alteration in the transcriptional program of human breast tumors , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Bert Vogelstein,et al.  Genetic instability and darwinian selection in tumours , 1999 .

[6]  G. Churchill Fundamentals of experimental design for cDNA microarrays , 2002, Nature Genetics.

[7]  Elena Marchiori,et al.  Breakpoint identification and smoothing of array comparative genomic hybridization data , 2004, Bioinform..

[8]  M. Ringnér,et al.  Impact of DNA amplification on gene expression patterns in breast cancer. , 2002, Cancer research.

[9]  Ajay N. Jain,et al.  Hidden Markov models approach to the analysis of array CGH data , 2004 .

[10]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[11]  Thomas Cremer,et al.  Detection of complete and partial chromosome gains and losses by comparative genomic in situ hybridization , 1993, Human Genetics.

[12]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[13]  Anne Kallioniemi,et al.  Targets of gene amplification and overexpression at 17q in gastric cancer. , 2002, Cancer research.

[14]  M. Caligiuri,et al.  Expression profiling reveals fundamental biological differences in acute myeloid leukemia with isolated trisomy 8 and normal cytogenetics. , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[15]  L. Staudt,et al.  Diffuse large B-cell lymphoma subgroups have distinct genetic profiles that influence tumor biology and improve gene-expression-based survival prediction. , 2005, Blood.

[16]  W. Linehan,et al.  The consequences of chromosomal aneuploidy on gene expression profiles in a cell line model for prostate carcinogenesis. , 2001, Cancer research.

[17]  L. Staudt,et al.  Specific secondary genetic alterations in mantle cell lymphoma provide prognostic information independent of the gene expression-based proliferation signature. , 2007, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[18]  L. Staudt,et al.  Molecular subtypes of diffuse large B-cell lymphoma arise by distinct genetic pathways , 2008, Proceedings of the National Academy of Sciences.

[19]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[20]  Simon Tavaré,et al.  BioHMM: a heterogeneous hidden Markov model for segmenting array CGH data , 2006, Bioinform..

[21]  David Botstein,et al.  Gene expression patterns and gene copy number changes in dermatofibrosarcoma protuberans. , 2003, The American journal of pathology.

[22]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[23]  George Karypis,et al.  Gene Classification Using Expression Profiles: A Feasibility Study , 2005, Int. J. Artif. Intell. Tools.

[24]  Karen Dybkær,et al.  Genomic Analyses Reveal Global Functional Alterations That Promote Tumor Growth and Novel Tumor Suppressor Genes in Natural Killer-Cell Malignancies , 2008 .

[25]  D. Pinkel,et al.  Comparative Genomic Hybridization for Molecular Cytogenetic Analysis of Solid Tumors , 2022 .