Preprocessing structured clinical data for predictive modeling and decision support

BACKGROUND EHR systems have high potential to improve healthcare delivery and management. Although structured EHR data generates information in machine-readable formats, their use for decision support still poses technical challenges for researchers due to the need to preprocess and convert data into a matrix format. During our research, we observed that clinical informatics literature does not provide guidance for researchers on how to build this matrix while avoiding potential pitfalls. OBJECTIVES This article aims to provide researchers a roadmap of the main technical challenges of preprocessing structured EHR data and possible strategies to overcome them. METHODS Along standard data processing stages - extracting database entries, defining features, processing data, assessing feature values and integrating data elements, within an EDPAI framework -, we identified the main challenges faced by researchers and reflect on how to address those challenges based on lessons learned from our research experience and on best practices from related literature. We highlight the main potential sources of error, present strategies to approach those challenges and discuss implications of these strategies. RESULTS Following the EDPAI framework, researchers face five key challenges: (1) gathering and integrating data, (2) identifying and handling different feature types, (3) combining features to handle redundancy and granularity, (4) addressing data missingness, and (5) handling multiple feature values. Strategies to address these challenges include: cross-checking identifiers for robust data retrieval and integration; applying clinical knowledge in identifying feature types, in addressing redundancy and granularity, and in accommodating multiple feature values; and investigating missing patterns adequately. CONCLUSIONS This article contributes to literature by providing a roadmap to inform structured EHR data preprocessing. It may advise researchers on potential pitfalls and implications of methodological decisions in handling structured data, so as to avoid biases and help realize the benefits of the secondary use of EHR data.

[1]  Adam Wright,et al.  White paper: A Roadmap for National Action on Clinical Decision Support , 2007, J. Am. Medical Informatics Assoc..

[2]  Christopher G Chute,et al.  The SHARPn project on secondary use of Electronic Medical Record data: progress, plans, and possibilities. , 2011, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[3]  Charles Safran,et al.  Toward a national framework for the secondary use of health data: an American Medical Informatics Association White Paper. , 2007, Journal of the American Medical Informatics Association : JAMIA.

[4]  Olga Brazhnik,et al.  Anatomy of data integration , 2007, J. Biomed. Informatics.

[5]  Alon Y. Halevy,et al.  Principles of Data Integration , 2012 .

[6]  Peter J. Haug,et al.  Data Preparation Framework for Preprocessing Clinical Data in Data Mining , 2006, AMIA.

[7]  Vimla L. Patel,et al.  Are three methods better than one? A comparative assessment of usability evaluation methods in an EHR , 2014, Int. J. Medical Informatics.

[8]  Lori C. Phillips,et al.  Using the i2b2 hive for clinical discovery: an example. , 2007, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[9]  Joshua C Denny,et al.  Generating Clinical Notes for Electronic Health Record Systems , 2010, Applied Clinical Informatics.

[10]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..

[11]  Eta S. Berner,et al.  Clinical Decision Support Systems , 1999, Health Informatics.

[12]  B. Wells,et al.  Strategies for Handling Missing Data in Electronic Health Record Derived Data , 2013, EGEMS.

[13]  C J McDonald,et al.  Computer-stored medical records. Their future role in medical practice. , 1988, JAMA.

[14]  James C. McClay,et al.  The Impact of Domain Knowledge on Structured Data Collection and Templated Note Design , 2013, Applied Clinical Informatics.

[15]  Crystal Kallem,et al.  Problem list guidance in the EHR. , 2011, Journal of AHIMA.

[16]  Ewa Pietka Large-Scale Hospital Information System in clinical practice , 2003, CARS.

[17]  A Burgun,et al.  Accessing and Integrating Data and Knowledge for Biomedical Research , 2008, Yearbook of Medical Informatics.

[18]  Wei Ma,et al.  RxNorm: prescription for electronic drug information exchange , 2005, IT Professional.

[19]  Huan Liu,et al.  Discretization: An Enabling Technique , 2002, Data Mining and Knowledge Discovery.

[20]  M. Gorelick,et al.  Bias arising from missing data in predictive models. , 2006, Journal of clinical epidemiology.

[21]  T Beale,et al.  openEHR Architecture Architecture Overview , 2006 .

[22]  Monique Frize,et al.  A prototype XML-based implementation of an integrated 'intelligent' neonatal intensive care unit , 2003, 4th International IEEE EMBS Special Topic Conference on Information Technology Applications in Biomedicine, 2003..

[23]  H. Prokosch,et al.  Perspectives for Medical Informatics , 2009, Methods of Information in Medicine.

[24]  M. Kenward,et al.  Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls , 2009, BMJ : British Medical Journal.

[25]  Valeria De Antonellis,et al.  Relational Database Theory , 1993 .

[26]  Neil O'Hare,et al.  The use of artificial neural networks to stratify the length of stay of cardiac patients based on preoperative and initial postoperative factors , 2007, Artif. Intell. Medicine.

[27]  J. Cimino Review Paper: Coding Systems in Health Care , 1995, Methods of Information in Medicine.

[28]  Klaus A. Kuhn,et al.  Health information systems challenges: the Heidelberg conference and the future , 2003, Int. J. Medical Informatics.

[29]  George Hripcsak,et al.  Health data use, stewardship, and governance: ongoing gaps and challenges: a report from AMIA's 2012 Health Policy Meeting , 2014, J. Am. Medical Informatics Assoc..

[30]  Rick Cattell,et al.  Scalable SQL and NoSQL data stores , 2011, SGMD.

[31]  Terhilda Garrido,et al.  The Kaiser Permanente Electronic Health Record: transforming and streamlining modalities of care. , 2009, Health affairs.

[32]  D. Suits Use of Dummy Variables in Regression Equations , 1957 .

[33]  Leonardo Franco,et al.  Missing data imputation using statistical and machine learning methods in a real breast cancer problem , 2010, Artif. Intell. Medicine.

[34]  Bruce E. Bray,et al.  Architecture of a Federated Query Engine for Heterogeneous Resources , 2009, AMIA.

[35]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[36]  D. Heitjan,et al.  Annotation: what can be done about missing data? Approaches to imputation. , 1997, American journal of public health.

[37]  Clement J. McDonald,et al.  Development of the Logical Observation Identifier Names and Codes (LOINC) vocabulary. , 1998, Journal of the American Medical Informatics Association : JAMIA.

[38]  A. Hoerbst,et al.  Electronic Health Records , 2010, Methods of Information in Medicine.

[39]  T. Stijnen,et al.  Review: a gentle introduction to imputation of missing values. , 2006, Journal of clinical epidemiology.

[40]  Hugues Bersini,et al.  A Survey on Filter Techniques for Feature Selection in Gene Expression Microarray Analysis , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[41]  Griffin M. Weber,et al.  Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2) , 2010, J. Am. Medical Informatics Assoc..

[42]  S. Schneeweiss Learning from big health care data. , 2014, The New England journal of medicine.

[43]  Jimeng Sun,et al.  PARAMO: A PARAllel predictive MOdeling platform for healthcare analytic research using electronic health records , 2014, J. Biomed. Informatics.

[44]  David L. Olson,et al.  Rule induction in data mining: effect of ordinal scales , 2002, Expert Syst. Appl..

[45]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[46]  Krzysztof J. Cios,et al.  Uniqueness of medical data mining , 2002, Artif. Intell. Medicine.

[47]  Christel Daniel-Le Bozec,et al.  Using electronic health records for clinical research: The case of the EHR4CR project , 2015, J. Biomed. Informatics.

[48]  J. Cimino Desiderata for Controlled Medical Vocabularies in the Twenty-First Century , 1998, Methods of Information in Medicine.

[49]  William J Donnelly,et al.  Viewpoint:: Patient-Centered Medical Care Requires a Patient-Centered Medical Record , 2005, Academic medicine : journal of the Association of American Medical Colleges.

[50]  Kup-Sze Choi,et al.  Alternatives to relational database: Comparison of NoSQL and XML approaches for clinical data storage , 2013, Comput. Methods Programs Biomed..

[51]  Jason Roy,et al.  Prediction Modeling Using EHR Data: Challenges, Strategies, and a Comparison of Machine Learning Approaches , 2010, Medical care.

[52]  Richard P. Lippmann,et al.  An introduction to computing with neural nets , 1987 .

[53]  Casey C. Bennett,et al.  Utilizing RxNorm to Support Practical Computing Applications: Capturing Medication History in Live Electronic Health Records , 2012, J. Biomed. Informatics.

[54]  Blaz Zupan,et al.  Predictive data mining in clinical medicine: Current issues and guidelines , 2008, Int. J. Medical Informatics.

[55]  Casey Holmes The problem list beyond meaningful use. Part I: The problems with problem lists. , 2011, Journal of AHIMA.

[56]  A Depeursinge,et al.  Clinical Data Mining: a Review , 2009, Yearbook of Medical Informatics.

[57]  Grigorios Tsoumakas,et al.  Mining Multi-label Data , 2010, Data Mining and Knowledge Discovery Handbook.

[58]  Dipak Kalra,et al.  Building a Logical EHR architecture based on ISO 13606 standard and Semantic Web Technologies , 2010, MedInfo.

[59]  Heikki Mannila,et al.  Principles of Data Mining , 2001, Undergraduate Topics in Computer Science.

[60]  Paula Asikainen,et al.  The outcomes of regional healthcare information systems in health care: A review of the research literature , 2009, Int. J. Medical Informatics.

[61]  João Miguel da Costa Sousa,et al.  Missing data in medical databases: Impute, delete or classify? , 2013, Artif. Intell. Medicine.

[62]  Henry W. W. Potts,et al.  Predicting length of stay from an electronic patient record system: a primary total knee replacement example , 2014, BMC Medical Informatics and Decision Making.

[63]  Sotiris B. Kotsiantis,et al.  Supervised Machine Learning: A Review of Classification Techniques , 2007, Informatica.

[64]  Dimitris Koutsouris,et al.  Medical support system for continuation of care based on XML web technology , 2001, Int. J. Medical Informatics.

[65]  Paul A. Harris,et al.  Secondary use of clinical data: The Vanderbilt approach , 2014, J. Biomed. Informatics.

[66]  Christel Daniel-Le Bozec,et al.  Integrating clinical research with the Healthcare Enterprise: From the RE-USE project to the EHR4CR platform , 2011, J. Biomed. Informatics.

[67]  P. Shekelle,et al.  Systematic Review: Impact of Health Information Technology on Quality, Efficiency, and Costs of Medical Care , 2006, Annals of Internal Medicine.

[68]  Claudio Bartolini,et al.  A Service-oriented Architecture for Business Intelligence , 2007, IEEE International Conference on Service-Oriented Computing and Applications (SOCA '07).

[69]  Joel H. Saltz,et al.  Model Formulation: caGrid 1.0: An Enterprise Grid Infrastructure for Biomedical Research , 2008, J. Am. Medical Informatics Assoc..

[70]  D. Bates,et al.  Ten key considerations for the successful implementation and adoption of large-scale health information technology. , 2013, Journal of the American Medical Informatics Association : JAMIA.

[71]  J. Ross Quinlan,et al.  Decision trees and decision-making , 1990, IEEE Trans. Syst. Man Cybern..

[72]  Huan Liu,et al.  Feature Selection: An Ever Evolving Frontier in Data Mining , 2010, FSDM.

[73]  Vickie Nguyen,et al.  Detection and characterization of usability problems in structured data entry interfaces in dentistry , 2013, Int. J. Medical Informatics.

[74]  Amnon Shabo,et al.  Model Formulation: HL7 Clinical Document Architecture, Release 2 , 2006, J. Am. Medical Informatics Assoc..

[75]  Cui Tao,et al.  Building a robust, scalable and standards-driven infrastructure for secondary use of EHR data: The SHARPn project , 2012, J. Biomed. Informatics.