Denormalize and Delimit: How not to Make Data Extraction for Analysis More Complex than Necessary

There are many legitimate reasons why standards for formatting of biomedical research data are lengthy and complex (Souza, Kush, & Evans, 2007). However, the common scenario of a biostatistician simply needing to import a given dataset into their statistical software is at best under-served by these standards. Statisticians are forced to act as amateur database administrators to pivot and join their data into a usable form before they can even begin the work that they specialize in doing. Or worse, they find their choice of statistical tools dictated not by their own experience and skills, but by remote standards bodies or inertial administrative choices. This may limit academic freedom. If the formats in question require the use of one proprietary software package, it also raises concerns about vendor lock-in (DeLano, 2005) and stewardship of public resources.The logistics and transparency of data sharing can be made more tractable by an appreciation of the differences between structural, semantic, and syntactic levels of data interoperability. The semantic level is legitimately a complex problem. Here we make the case that, for the limited purpose of statistical analysis, a simplifying assumption can be made about structural level: the needs of a large number of statistical models can often be met with a modified variant of the first normal form or 1NF (Codd, 1979). Once data is merged into one such table, the syntactic level becomes a solved problem, with many text based formats available and robustly supported by virtually all statistical software without the need for any custom or third-party client-side add-ons. We implemented our denormalization approach in DataFinisher, an open source server-side add-on for i2b2 (Murphy et al., 2009), which we use at our site to enable self-service pulls of de-identified data by researchers.

[1]  Dan Connolly,et al.  SEINE: Methods for Electronic Data Capture and Integrated Data Repository Synthesis with Patient Registry Use Cases , 2015 .

[2]  T Ganslandt,et al.  Integrated Data Repository Toolkit (IDRT). A Suite of Programs to Facilitate Health Analytics on Heterogeneous Medical Data. , 2016, Methods of information in medicine.

[3]  Stephen B. Johnson Model Formulation: Generic Data Modeling for Clinical Rrepositories , 1996, J. Am. Medical Informatics Assoc..

[4]  J Glaser,et al.  Separation of Concerns , 2014 .

[5]  Richard Platt,et al.  Launching PCORnet, a national patient-centered clinical research network , 2014, Journal of the American Medical Informatics Association : JAMIA.

[6]  R. Kush,et al.  Global clinical data interchange standards are here! , 2007, Drug discovery today.

[7]  McKinney Wes,et al.  Python for Data Analysis , 2012 .

[8]  I. Kohane,et al.  Instrumenting the health care enterprise for discovery research in the genomic era. , 2009, Genome research.

[9]  Wes McKinney,et al.  Data Structures for Statistical Computing in Python , 2010, SciPy.

[10]  Cynthia Brandt,et al.  Pivoting approaches for bulk extraction of Entity-Attribute-Value data , 2006, Comput. Methods Programs Biomed..

[11]  Jr. Frederick P. Brooks,et al.  The Mythical Man-Month: Essays on Softw , 1978 .

[12]  Alan Edelman,et al.  Julia: A Fast Dynamic Language for Technical Computing , 2012, ArXiv.

[13]  P. Harris,et al.  Research electronic data capture (REDCap) - A metadata-driven methodology and workflow process for providing translational research informatics support , 2009, J. Biomed. Informatics.

[14]  Fred P. Brooks,et al.  The Mythical Man-Month , 1975, Reliable Software.

[15]  J. Overhage,et al.  Advancing the Science for Active Surveillance: Rationale and Design for the Observational Medical Outcomes Partnership , 2010, Annals of Internal Medicine.

[16]  E. F. Codd,et al.  Extending the database relational model to capture more meaning , 1979, ACM Trans. Database Syst..

[17]  Douglas Crockford,et al.  The application/json Media Type for JavaScript Object Notation (JSON) , 2006, RFC.

[18]  Marsha A Raebel,et al.  Design considerations, architecture, and use of the Mini‐Sentinel distributed data system , 2012, Pharmacoepidemiology and drug safety.

[19]  David Levine,et al.  The Analytic Information Warehouse (AIW): A platform for analytics using electronic health record data , 2013, J. Biomed. Informatics.

[20]  Douglas MacFadden,et al.  Application of Information Technology The Shared Health Research Information Network ( SHRINE ) : A Prototype Federated Query Tool for Clinical Data Repositories , 2014 .

[21]  Jeannette M. Wing,et al.  A behavioral notion of subtyping , 1994, TOPL.

[22]  Frederick P. Brooks,et al.  The Mythical Man-Month: Essays on Softw , 1978 .

[23]  W. Delano The case for open-source software in drug discovery. , 2005, Drug discovery today.

[24]  Andrew Copas,et al.  Review of methods for handling confounding by cluster and informative cluster size in clustered data , 2014, Statistics in medicine.

[25]  Prakash M. Nadkarni,et al.  The Greater Plains Collaborative: a PCORnet Clinical Research Data Network , 2014, J. Am. Medical Informatics Assoc..