A Bayesian Nonparametric Model for Disease Subtyping: Application to Emphysema Phenotypes

We introduce a novel Bayesian nonparametric model that uses the concept of <italic>disease trajectories</italic> for disease subtype identification. Although our model is general, we demonstrate that by treating fractions of tissue patterns derived from medical images as compositional data, our model can be applied to study distinct progression trends between population subgroups. Specifically, we apply our algorithm to quantitative emphysema measurements obtained from chest CT scans in the COPDGene Study and show several distinct progression patterns. As emphysema is one of the major components of chronic obstructive pulmonary disease (COPD), the third leading cause of death in the United States <xref ref-type="bibr" rid="ref1">[1]</xref>, an improved definition of emphysema and COPD subtypes is of great interest. We investigate several models with our algorithm, and show that one with <inline-formula> <tex-math notation="LaTeX">$age$ </tex-math></inline-formula>, <inline-formula> <tex-math notation="LaTeX">$pack~years$ </tex-math></inline-formula> (a measure of cigarette exposure), and <inline-formula> <tex-math notation="LaTeX">$smoking~status$ </tex-math></inline-formula> as predictors gives the best compromise between estimated predictive performance and model complexity. This model identified nine subtypes which showed significant associations to seven single nucleotide polymorphisms (SNPs) known to associate with COPD. Additionally, this model gives better predictive accuracy than multiple, multivariate ordinary least squares regression as demonstrated in a five-fold cross validation analysis. We view our subtyping algorithm as a contribution that can be applied to bridge the gap between CT-level assessment of tissue composition to population-level analysis of compositional trends that vary between disease subtypes.

[1]  E. Arias,et al.  Deaths: Final Data for 2016. , 2018, National vital statistics reports : from the Centers for Disease Control and Prevention, National Center for Health Statistics, National Vital Statistics System.

[2]  Edwin K Silverman,et al.  CT-Definable Subtypes of Chronic Obstructive Pulmonary Disease: A Statement of the Fleischner Society. , 2015, Radiology.

[3]  Suchi Saria,et al.  Clustering Longitudinal Clinical Marker Trajectories from Electronic Health Data: Applications to Phenotyping and Endotype Discovery , 2015, AAAI.

[4]  Xiang Wang,et al.  Unsupervised learning of disease progression models , 2014, KDD.

[5]  Jure Leskovec,et al.  Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining , 2014, KDD 2014.

[6]  Stephanie A. Santorico,et al.  Cluster analysis in the COPDGene study identifies subtypes of smokers with distinct patterns of airway disease and emphysema , 2014, Thorax.

[7]  I. Kohane,et al.  Comorbidity Clusters in Autism Spectrum Disorders: An Electronic Health Record Time-Series Analysis , 2014, Pediatrics.

[8]  Raúl San José Estépar,et al.  Distinct quantitative computed tomography emphysema patterns are associated with physiology and function in smokers. , 2013, American journal of respiratory and critical care medicine.

[9]  Jennifer G. Dy,et al.  Nonparametric Mixture of Gaussian Processes with Constraints , 2013, ICML.

[10]  Raúl San José Estépar,et al.  Emphysema quantification in a multi-scanner HRCT cohort using local intensity distributions , 2012, 2012 9th IEEE International Symposium on Biomedical Imaging (ISBI).

[11]  E. Regan,et al.  Genetic Epidemiology of COPD (COPDGene) Study Design , 2011, COPD.

[12]  John K Kruschke,et al.  Bayesian data analysis. , 2010, Wiley interdisciplinary reviews. Cognitive science.

[13]  Daniel S Nagin,et al.  Group-based trajectory modeling in clinical research. , 2010, Annual review of clinical psychology.

[14]  Sumio Watanabe,et al.  Asymptotic Equivalence of Bayes Cross Validation and Widely Applicable Information Criterion in Singular Learning Theory , 2010, J. Mach. Learn. Res..

[15]  Marleen de Bruijne,et al.  Quantitative Analysis of Pulmonary Emphysema Using Local Binary Patterns , 2010, IEEE Transactions on Medical Imaging.

[16]  S. Gabriel,et al.  Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. , 2010, Cancer cell.

[17]  Arcot Sowmya,et al.  Multi-level classification of emphysema in HRCT lung images , 2009, Pattern Analysis and Applications.

[18]  Arcot Sowmya,et al.  Multi-level Classification of Emphysema in HRCT Lung Images Using Delegated Classifiers , 2008, MICCAI.

[19]  J. Seo,et al.  Texture-Based Quantification of Pulmonary Emphysema on High-Resolution Computed Tomography: Comparison With Density-Based Quantification and Correlation With Pulmonary Function Test , 2008, Investigative radiology.

[20]  G. Mateu-Figueras,et al.  The normal distribution in some constrained sample spaces , 2008, 0802.2643.

[21]  H. Muller,et al.  Lung Tissue Classification Using Wavelet Frames , 2007, 2007 29th Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[22]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[23]  C. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[24]  Michael I. Jordan,et al.  Variational inference for Dirichlet process mixtures , 2006 .

[25]  V. Pawlowsky-Glahn,et al.  Groups of Parts and Their Balances in Compositional Data Analysis , 2005 .

[26]  T. Robbins,et al.  Heterogeneity of Parkinson’s disease in the early clinical stages using a data driven approach , 2005, Journal of Neurology, Neurosurgery & Psychiatry.

[27]  G. Mateu-Figueras,et al.  Isometric Logratio Transformations for Compositional Data Analysis , 2003 .

[28]  V. Pawlowsky-Glahn,et al.  Dealing with Zeros and Missing Values in Compositional Data Sets Using Nonparametric Imputation , 2003 .

[29]  B W Turnbull,et al.  Discovering subpopulation structure with latent class mixed models , 2002, Statistics in medicine.

[30]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[31]  E. Hoffman,et al.  Quantification of pulmonary emphysema from lung computed tomography images. , 1997, American journal of respiratory and critical care medicine.

[32]  M. Cosio,et al.  Centrilobular and panlobular emphysema in smokers. Two distinct morphologic and functional entities. , 1991, The American review of respiratory disease.

[33]  J. Sethuraman A CONSTRUCTIVE DEFINITION OF DIRICHLET PRIORS , 1991 .

[34]  John Aitchison,et al.  The Statistical Analysis of Compositional Data , 1986 .

[35]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Ross P. Kindermann,et al.  Markov Random Fields and Their Applications , 1980 .

[37]  H. Akaike A new look at the statistical model identification , 1974 .

[38]  C. Antoniak Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems , 1974 .

[39]  T. Ferguson A Bayesian Analysis of Some Nonparametric Problems , 1973 .

[40]  Juan José Egozcue Rubí,et al.  The normal distribution in some constrained sample spaces , 2013 .

[41]  V. Pawlowsky-Glahn,et al.  Exploring Compositional Data with the CoDa-Dendrogram , 2011 .

[42]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[43]  D. Postma,et al.  Chronic obstructive pulmonary disease. , 2002, Clinical evidence.

[44]  Tommi S. Jaakkola,et al.  Tutorial on variational approximation methods , 2000 .

[45]  A. Raftery Bayesian Model Selection in Social Research , 1995 .

[46]  J. Besag Spatial Interaction and the Statistical Analysis of Lattice Systems , 1974 .