Methodological variations in lagged regression for detecting physiologic drug effects in EHR data

We studied how lagged linear regression can be used to detect the physiologic effects of drugs from data in the electronic health record (EHR). We systematically examined the effect of methodological variations ((i) time series construction, (ii) temporal parameterization, (iii) intra-subject normalization, (iv) differencing (lagged rates of change achieved by taking differences between consecutive measurements), (v) explanatory variables, and (vi) regression models) on performance of lagged linear methods in this context. We generated two gold standards (one knowledge-base derived, one expert-curated) for expected pairwise relationships between 7 drugs and 4 labs, and evaluated how the 64 unique combinations of methodological perturbations reproduce the gold standards. Our 28 cohorts included patients in the Columbia University Medical Center/NewYork-Presbyterian Hospital clinical database, and ranged from 2820 to 79,514 patients with between 8 and 209 average time points per patient. The most accurate methods achieved AUROC of 0.794 for knowledge-base derived gold standard (95%CI [0.741, 0.847]) and 0.705 for expert-curated gold standard (95% CI [0.629, 0.781]). We observed a mean AUROC of 0.633 (95%CI [0.610, 0.657], expert-curated gold standard) across all methods that re-parameterize time according to sequence and use either a joint autoregressive model with time-series differencing or an independent lag model without differencing. The complement of this set of methods achieved a mean AUROC close to 0.5, indicating the importance of these choices. We conclude that time-series analysis of EHR data will likely rely on some of the beneficial pre-processing and modeling methodologies identified, and will certainly benefit from continued careful analysis of methodological perturbations. This study found that methodological variations, such as pre-processing and representations, have a large effect on results, exposing the importance of thoroughly evaluating these components when comparing machine-learning methods.

[1]  Noémie Elhadad,et al.  Identifying and mitigating biases in EHR laboratory tests , 2014, J. Biomed. Informatics.

[2]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data: Little/Statistical Analysis with Missing Data , 2002 .

[3]  P. Young,et al.  Time series analysis, forecasting and control , 1972, IEEE Transactions on Automatic Control.

[4]  George Hripcsak,et al.  Using time-delayed mutual information to discover and interpret temporal correlation structure in complex populations , 2011, Chaos.

[5]  Yuval Shahar,et al.  Classification of multivariate time series via temporal abstraction and time intervals mining , 2015, Knowledge and Information Systems.

[6]  George Hripcsak,et al.  High-fidelity phenotyping: richness and freedom from bias , 2017, J. Am. Medical Informatics Assoc..

[7]  George Hripcsak,et al.  Exploiting time in electronic health record correlations , 2011, J. Am. Medical Informatics Assoc..

[8]  George Hripcsak,et al.  Next-generation phenotyping of electronic health records , 2012, J. Am. Medical Informatics Assoc..

[9]  C.E. Shannon,et al.  Communication in the Presence of Noise , 1949, Proceedings of the IRE.

[10]  G. Hripcsak,et al.  Estimation of time-delayed mutual information and bias for irregularly and sparsely sampled time-series. , 2011, Chaos, solitons, and fractals.

[11]  Georgios B. Giannakis,et al.  Statistical Signal Processing, Higher Order Tools , 1999 .

[12]  Fei Wang,et al.  Towards heterogeneous temporal clinical event pattern discovery: a convolutional approach , 2012, KDD.

[13]  George Hripcsak,et al.  Temporal trends of hemoglobin A1c testing , 2014, J. Am. Medical Informatics Assoc..

[14]  G. Hripcsak,et al.  Correlating electronic health record concepts with healthcare process events , 2013, Journal of the American Medical Informatics Association : JAMIA.

[15]  Craig K. Enders,et al.  Applied Missing Data Analysis , 2010 .

[16]  Milos Hauskrecht,et al.  Learning Linear Dynamical Systems from Multivariate Time Series: A Matrix Factorization Based Framework , 2016, SDM.

[17]  Yu-Chuan Li,et al.  Observational Health Data Sciences and Informatics (OHDSI): Opportunities for Observational Researchers , 2015, MedInfo.

[18]  G. Hripcsak,et al.  A statistical dynamics approach to the study of human health data: resolving population scale diurnal variation in laboratory data. , 2010, Physics letters. A.

[19]  G. Hripcsak,et al.  Population Physiology: Leveraging Electronic Health Record Data to Understand Human Endocrine Dynamics , 2011, PloS one.

[20]  L. Bauwens,et al.  Econometrics , 2005 .

[21]  G. Niklas Norén,et al.  Temporal pattern discovery in longitudinal electronic patient records , 2010, Data Mining and Knowledge Discovery.

[22]  C. Granger,et al.  Spurious regressions in econometrics , 1974 .

[23]  Richard G. Lyons,et al.  Understanding Digital Signal Processing , 1996 .

[24]  Gwilym M. Jenkins,et al.  Time series analysis, forecasting and control , 1971 .

[25]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[26]  George Hripcsak,et al.  Parameterizing time in electronic health record studies , 2015, J. Am. Medical Informatics Assoc..

[27]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[28]  Rae Woong Park,et al.  Characterizing treatment pathways at scale using the OHDSI network , 2016, Proceedings of the National Academy of Sciences.

[29]  Christophe G. Lambert,et al.  Bridging Islands of Information to Establish an Integrated Knowledge Base of Drugs and Health Outcomes of Interest , 2014, Drug Safety.

[30]  M. Melamed Detection , 2021, SETI: Astronomy as a Contact Sport.

[31]  Cui Tao,et al.  Comprehensive temporal information detection from clinical text: medical events, time, and TLINK identification , 2013, J. Am. Medical Informatics Assoc..

[32]  Milos Hauskrecht,et al.  Sparse Linear Dynamical System with Its Application in Multivariate Clinical Time Series , 2013, ArXiv.

[33]  Jonathan D. Cryer,et al.  Time Series Analysis , 1986 .

[34]  S. Golder,et al.  Systematic review on the prevalence, frequency and comparative value of adverse events data in social media. , 2015, British journal of clinical pharmacology.

[35]  Yuval Shahar,et al.  Medical Temporal-Knowledge Discovery via Temporal Abstraction , 2009, AMIA.

[36]  W. Fuller,et al.  Distribution of the Estimators for Autoregressive Time Series with a Unit Root , 1979 .

[37]  Shuang Wang,et al.  Hidden in plain sight: bias towards sick patients when sampling patients with sufficient electronic health record data for research , 2014, BMC Medical Informatics and Decision Making.

[38]  Siriwon Taewijit,et al.  Data-driven Approach to Detect and Predict Adverse Drug Reactions. , 2016, Current pharmaceutical design.

[39]  Yuval Shahar,et al.  Irregular-Time Bayesian Networks , 2010, UAI.

[40]  George Hripcsak,et al.  Estimating summary statistics for electronic health record laboratory data for use in high-throughput phenotyping algorithms , 2018, J. Biomed. Informatics.

[41]  George Hripcsak,et al.  Comparing Lagged Linear Correlation, Lagged Regression, Granger Causality, and Vector Autoregression for Uncovering Associations in EHR Data , 2016, AMIA.

[42]  Edward J. Wegman,et al.  Statistical Signal Processing , 1985 .

[43]  Harry L. Van Trees,et al.  Detection, Estimation, and Modulation Theory: Radar-Sonar Signal Processing and Gaussian Signals in Noise , 1992 .

[44]  Milos Hauskrecht,et al.  A Pattern Mining Approach for Classifying Multivariate Temporal Data , 2011, 2011 IEEE International Conference on Bioinformatics and Biomedicine.

[45]  T. Lasko,et al.  Computational Phenotype Discovery Using Unsupervised Feature Learning over Noisy, Sparse, and Irregular Clinical Data , 2013, PloS one.

[46]  Milos Hauskrecht,et al.  Learning Adaptive Forecasting Models from Irregularly Sampled Multivariate Clinical Data , 2016, AAAI.

[47]  George Hripcsak,et al.  Review Paper: Detecting Adverse Events Using Information Technology , 2003, J. Am. Medical Informatics Assoc..

[48]  E. Tabak,et al.  Dynamical Phenotyping: Using Temporal Analysis of Clinically Collected Physiologic Data to Stratify Populations , 2014, PloS one.

[49]  C. Granger Investigating causal relations by econometric models and cross-spectral methods , 1969 .

[50]  Charles F. Hockett,et al.  A mathematical theory of communication , 1948, MOCO.

[51]  D. Cox,et al.  An Analysis of Transformations , 1964 .

[52]  F. Harris On the use of windows for harmonic analysis with the discrete Fourier transform , 1978, Proceedings of the IEEE.

[53]  Sunghwan Sohn,et al.  Drug side effect extraction from clinical narratives of psychiatry and psychology patients , 2011, J. Am. Medical Informatics Assoc..