Fast and simple comparison of semi-structured data, with emphasis on electronic health records

We present a locality-sensitive hashing strategy for summarizing semi-structured data (e.g., in JSON or XML formats) into ‘data fingerprints’: highly compressed representations which cannot recreate details in the data, yet simplify and greatly accelerate the comparison and clustering of semi-structured data by preserving similarity relationships. Computation on data fingerprints is fast: in one example involving complex simulated medical records, the average time to encode one record was 0.53 seconds, and the average pairwise comparison time was 3.75 microseconds. Both processes are trivially parallelizable. Applications include detection of duplicates, clustering and classification of semi-structured data, which support larger goals including summarizing large and complex data sets, quality assessment, and data mining. We illustrate use cases with three analyses of electronic health records (EHRs): (1) pairwise comparison of patient records, (2) analysis of cohort structure, and (3) evaluation of methods for generating simulated patient data.