Constrained mixture estimation for analysis and robust classification of clinical time series

Motivation: Personalized medicine based on molecular aspects of diseases, such as gene expression profiling, has become increasingly popular. However, one faces multiple challenges when analyzing clinical gene expression data; most of the well-known theoretical issues such as high dimension of feature spaces versus few examples, noise and missing data apply. Special care is needed when designing classification procedures that support personalized diagnosis and choice of treatment. Here, we particularly focus on classification of interferon-β (IFNβ) treatment response in Multiple Sclerosis (MS) patients which has attracted substantial attention in the recent past. Half of the patients remain unaffected by IFNβ treatment, which is still the standard. For them the treatment should be timely ceased to mitigate the side effects. Results: We propose constrained estimation of mixtures of hidden Markov models as a methodology to classify patient response to IFNβ treatment. The advantages of our approach are that it takes the temporal nature of the data into account and its robustness with respect to noise, missing data and mislabeled samples. Moreover, mixture estimation enables to explore the presence of response sub-groups of patients on the transcriptional level. We clearly outperformed all prior approaches in terms of prediction accuracy, raising it, for the first time, >90%. Additionally, we were able to identify potentially mislabeled samples and to sub-divide the good responders into two sub-groups that exhibited different transcriptional response programs. This is supported by recent findings on MS pathology and therefore may raise interesting clinical follow-up questions. Availability: The method is implemented in the GQL framework and is available at http://www.ghmm.org/gql. Datasets are available at http://www.cin.ufpe.br/∼igcf/MSConst Contact: igcf@cin.ufpe.br Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  Lawrence M. Pfeffer,et al.  Interferonα Activates NF-κB in JAK1-deficient Cells through a TYK2-dependent Pathway* , 2005, Journal of Biological Chemistry.

[2]  Ziv Bar-Joseph,et al.  A Patient-Gene Model for Temporal Expression Profiles in Clinical Studies , 2006, RECOMB.

[3]  Jill P. Mesirov,et al.  Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data , 2003, Machine Learning.

[4]  Adrian E. Raftery,et al.  How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis , 1998, Comput. J..

[5]  P. Deb Finite Mixture Models , 2008 .

[6]  Ziv Bar-Joseph,et al.  Alignment and classification of time series gene expression in clinical studies , 2008, ISMB.

[7]  L. Pfeffer,et al.  Interferon alpha activates NF-kappaB in JAK1-deficient cells through a TYK2-dependent pathway. , 2005, The Journal of biological chemistry.

[8]  Kazumasa Yokoyama,et al.  T cell gene expression profiling identifies distinct subgroups of Japanese multiple sclerosis patients , 2006, Journal of Neuroimmunology.

[9]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[10]  Arindam Banerjee,et al.  Active Semi-Supervision for Pairwise Constrained Clustering , 2004, SDM.

[11]  Tommi S. Jaakkola,et al.  A new approach to analyzing gene expression time series data , 2002, RECOMB '02.

[12]  H. Hartung,et al.  The role of B cells and autoantibodies in multiple sclerosis , 2000, Annals of neurology.

[13]  R. Bernards,et al.  Enabling personalized cancer medicine through analysis of gene-expression patterns , 2008, Nature.

[14]  Ivan G. Costa,et al.  Analyzing gene expression time-courses , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[15]  Alexander Schönhuth,et al.  Semi-supervised Clustering of Yeast Gene Expression Data , 2009 .

[16]  Vittorio Castelli,et al.  On the exponential value of labeled samples , 1995, Pattern Recognit. Lett..

[17]  Alexander Schliep,et al.  Semi-supervised learning for the identification of syn-expressed genes from fused microarray and in situ image data , 2007, BMC Bioinformatics.

[18]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[19]  Hedi Peterson,et al.  g:Profiler—a web-based toolset for functional profiling of gene lists from large-scale experiments , 2007, Nucleic Acids Res..

[20]  Hans-Peter Kriegel,et al.  Class Prediction from Time Series Gene Expression Profiles Using Dynamical Systems Kernels , 2005, Pacific Symposium on Biocomputing.

[21]  Alexander Schliep,et al.  Using hidden Markov models to analyze gene expression time course data , 2003, ISMB.

[22]  Pablo Tamayo,et al.  Metagenes and molecular pattern discovery using matrix factorization , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[23]  W. Paul,et al.  The IL-4 receptor: signaling mechanisms and biologic functions. , 1999, Annual review of immunology.

[24]  Ziv Bar-Joseph,et al.  Clustering short time series gene expression data , 2005, ISMB.

[25]  Ashutosh Kumar Singh,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2010 .

[26]  Rainer Spang,et al.  Computational diagnostics with gene expression profiles. , 2008, Methods in molecular biology.

[27]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[28]  Jeff A. Bilmes,et al.  A gentle tutorial of the em algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models , 1998 .

[29]  Joachim M. Buhmann,et al.  Learning with constrained and unlabelled data , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[30]  L. Greller,et al.  Transcription-Based Prediction of Response to IFNβ Using Supervised Computational Methods , 2004, PLoS biology.

[31]  Zhengdong Lu,et al.  Semi-supervised Learning with Penalized Probabilistic Clustering , 2004, NIPS.

[32]  Carlos Nos,et al.  Assessment of different treatment failure criteria in a cohort of relapsing–remitting multiple sclerosis patients treated with interferon β: Implications for clinical trials , 2002, Annals of neurology.

[33]  C L Verweij,et al.  A subtype of multiple sclerosis defined by an activated immune defense program , 2006, Genes and Immunity.

[34]  John Quackenbush,et al.  Multiple-laboratory comparison of microarray platforms , 2005, Nature Methods.

[35]  Alexander Schliep,et al.  Robust inference of groups in gene expression time-courses using mixtures of HMMs , 2004, ISMB/ECCB.

[36]  Rainer Spang,et al.  Diagnostic signatures from microarrays: a bioinformatics concept for personalized medicine. , 2003, Drug discovery today.