The Evolution of Patient Diagnosis: From Art to Digital Data-Driven Science

Physicians are still taught to diagnose patients accordingtothe19th-centuryOslerianblueprint.Aphysiciantakes a history, performs an examination, and matches each patient to the traditional taxonomy of medical conditions. Symptoms,signs,familyhistory,andlaboratoryreportsare interpreted in light of clinical experience and scholarly interpretation of the medical literature. However, diagnosis is evolving from art to data-driven science, whereby large populations contextualize each individual’s medical condition. Advances in artificial intelligence now bring insight frompopulation-leveldatatoindividualcare;arecentstudy sponsored by and including researchers from Google used data sets with more than 11 000 retinal fundus images to develop a deep learning algorithm that outperformed clinicians for detecting diabetic retinopathy.1 However, when clinicians make genetic diagnoses, they are practicing more like Osler than Google. The pathogenicity of a genetic variant is often determined from cohort studies of relatively small numbers of individuals. Even a simple comparison of a patient’s variant to a larger population of matched ancestry is generally not possible. Manrai et al2 illustrated the pitfalls of this approach, showing that monogenic variants considered diagnostic of hypertrophic cardiomyopathy, in fact, have a high frequency in unaffected individuals of African ancestry and, therefore, often apparently represent normal variants among black patients. This problem is more general and many studies implicate pathogenic variants, yet lack sufficientnumbersofancestrallydiversecasesandcontrols.Furthermore, Van Driest et al3 reviewed electronic health record(EHR)dataandelectrocardiographsinacohortof2022 genotyped patients and found that the majority of participants (41 of 63) with a designated variant in either SCN5A or KCNH2—putatively associated with cardiac rhythm disturbances—hadnoidentifiablepathologicalphenotype. Artificial intelligence will eventually help clinicians extract the maximum knowledge from large genetic reference data sets. But the first step is to simply be able to calculate the statistical genetics that reveal how often a variant is associated with pathology, and how outcomes compare in patients with and without the variant. To make a genetic diagnosis, a physician must evaluate a patient’s data against a larger and representative population. A high-functioning health care system not only needs the EHR databases produced as a byproduct of care, but also must link EHRs to samples, sequence data, and myriad data sources needed to characterize medical care, lifestyle, and environment. Initiatives to develop genetic reference data at the population level could be grouped into 3 categories. First are well-known databases of genotype-phenotype relationships as observed and submitted by researchers (eg, Online Mendelian Inheritance in Man, ClinVar, and the National Human Genome Research Institute’s Genome-Wide Association Study [GWAS] Catalog). Second are databases, such as the Genome Aggregation Database (gnomAD),4 the next iteration of the Exome Aggregation Consortium (ExAC) database,5 and the 1000 Genomes Project,6 that aggregate sequences collected from other studies for secondary use. Third, patients and other study participants are invited to donate data to registries like GenomeConnect or enroll in cohorts like the National Institutes of Health All of Us initiative, which is recruiting 1 million patients to contribute biological samples and EHR data for research. All 3 types of databases rely on a research framework rather than a clinical framework for accrual of patients and data. This distinction is important. The populations are selected by researchers or are self-selected, and the data are either deidentified or acquired after a research consent. The bias inherent in these populations may distort the accuracy of data-driven genomic diagnosis. Furthermore, even though deidentification addresses the Health Insurance Portability and Accountability Act of 1996 (HIPAA) privacy concerns, it often precludes future linkage of myriad data sets needed to create a robust information commons. The architects of All of Us have instrumented enrollment centers to provide ongoing longitudinal phenotype data from EHRs. Still, relying on consented research participants is not population-based and success is limited by the willingness of individuals to participate and the expense and logistics of consenting.