Design of a National Distributed Health Data Network

Key Summary Points: Attributes of a National Distributed Health Data Network Supports both observational and intervention studies. Local data holder control over access and uses of data. Mitigates need to share or exchange protected health information. Singular, multipurpose, multi-institutional infrastructure. A distributed health data network is a system that allows secure remote analysis of separate data sets, each derived from a different medical organization's or health plan's records. Such networks allow data holders to retain physical control over use of their data, thereby avoiding many obstacles related to confidentiality, regulation, and proprietary interests. They can be used for observational studies, particularly public health surveillance, and can also provide baseline and follow-up data to support clinical trials, including those that use cluster randomization. In addition, a network can monitor use, adoption, and diffusion of new technologies and clinical evidence. Such networks are critical elements of the learning health care system recommended by the Institute of Medicine (1), which supports the use of routinely collected health care data to improve our understanding of the comparative benefits and harms of medical technologies. The United States will soon be able to analyze data from millions of individuals. Congress has mandated that the U.S. Food and Drug Administration develop a postmarket risk identification and analysis system that covers 100 million persons (2). In addition, the expansion of comparative effectiveness research envisioned by Congress requires access to health care information for large, diverse populations in real-world settings (3). Large, centralized data repositories could support these functions, but we and others (4, 5) believe that a distributed health data network has many practical advantages. First, a distributed network allows data holders to retain physical and logical control of their data. Second, it mitigates many security, proprietary, legal, and privacy concerns, including those regulated by the Privacy and Security Rules of the Health Insurance Portability and Accountability Act (6). Third, it eliminates the need to create, maintain, and secure access to central data repositories. Fourth, it minimizes the need to disclose protected health information outside the data-owning entity. Finally, a distributed network allows data holders to assess, track, and authorize requests for all data uses. Several public agencies have supported the development of single-purpose distributed data networks, either directly or in principle (711). These networks are limited in scope and do not support the broad range of public and private needs filled by the network we describe. We favor a single distributed network with multiple usesfor example, one that could be used to study comparative clinical effectiveness and the diffusion of medical technologiesover multiple independent and single-purpose networks. A multipurpose network would reduce the burden on data holders of participating in multiple networks, as well as that on network developers of creating and maintaining redundant infrastructure. The framework that we describe suggests how we could develop a national network with broad capabilities. How Would a National Distributed Health Data Network Work? In the simplest national distributed health data network, each data holder creates a copy of their data (a network datamart) that adheres to a common data model, thus ensuring identical file structures, data fields, and coding systems. Several common data models already exist (10, 1217). The Figure illustrates the basic flow of network operations. Authorized users submit queries by means of a secure Web site. Data holders set authorization policies for each user and query type and can require approvals from privacy boards and institutional review boards. The network interface allows nontechnical users to ask simple questions without assistance (for example, a report on the uptake of a given treatment by age, sex, and geographic region). It also allows sophisticated users to perform complex analyses (for example, comparing the rates of serious cardiovascular outcomes among patients who receive different second-line antihypertensive treatments). For many questions, transferring protected health information will not be necessary. However, it may be necessary to aggregate relatively small amounts of data for analysis. Using the network, data holders may provide limited access to full-text medical records for validation and additional details. It is usually necessary to review only a small proportion of records to confirm diagnoses or to obtain risk factor data that are not coded (such as smoking status). Figure. System operations in a distributed health network. An authorized user accesses the secure network Web site to submit queries (computer programs) to run against data in the network datamarts. The boxes at the far right depict areas under control of the data holder (data holders A through D are shown). Authorization to execute a query is under control of the data holder and can be limited to specific users and uses. Data holders retrieve queries for execution, which eliminates the need for data holders to monitor incoming requests. Query results are encrypted and returned to the central Web site, where they are processed and presented to the requester. Details of each step are recorded for auditing. Example of the Use of a Distributed Network Some research programs already use a distributed network model (10, 14, 18), which provides a relevant starting point to implement a national network. The HMO Research Network Center for Education and Research on Therapeutics has conducted many multisite studies by distributing computer programs that each site applied to a local copy of their data. The outputs are then combined to provide aggregate results. Examples of studies performed in this way include the evaluation of laboratory monitoring practices for medications (1825), the use of medications during pregnancy (2628), and the use of medications that carry a black box warning (29). Such studies provide an important evidence development function that feeds back to providers, payers, and patients. Policy Issues Development and implementation of a multipurpose, multi-institutional distributed health data network requires substantial stakeholder engagement and dedicated software development. On the basis of the previously described research studies, we recommend incremental implementation with a limited set of data holders and data types. Begin with information about eligibility for health care (such as health plan enrollment data); this would allow identification of defined populations, which are important for many uses. Initial data should also include demographic characteristics; diagnosis, procedure, and pharmacy dispensing data (30); and, potentially, electronic health record data, such as vital signs. During initial implementation, pilot testing is needed to assess network design, software development, and development and implementation of the common data model. A distributed network's viability depends on both its governance mechanisms and sustained funding. A governance institution is needed to develop and oversee procedures for requesting use of the network; to set priorities; and to audit use for compliance with various security, privacy, human subject research, and proprietary concerns. Such an institution should also monitor research integrity, data integrity, conflict of interest policies, transparency of activity and results, policies related to access and use, reproducibility, publishing rights, and dispute resolution. Annual development and maintenance costs would probably be several tens of millions of dollars for an initial system that covers up to 100 million persons. This would be similar to the 3-year startup cost for the National Cancer Institute's Cancer Biomedical Informatics Grid, which totaled $60 million for fiscal years 2004 to 2006 (31). The National Cancer Institute fiscal year 2010 budget requests $100 million for these efforts in addition to the current funding level (32). The total annual cost of developing and maintaining a network is in line with that of individual clinical trials routinely performed to evaluate new pharmaceuticals. Although initial implementation costs are sizeable, the expected marginal costs to use the system would be small for any particular study. Various funding mechanisms are possible. Initially, we expect costs to be borne by the federal entities, whose current needs would drive network implementation. Ultimately, we believe the costs should be amortized over the system's multiple users and should support the network's expansion, functionality, and use. For example, methods could be developed for linking to the National Death Index or identifying individuals for whom multiple data holders possess different kinds of information (such as pharmacy data held by one source and clinical encounter data held by another). Advances in technologies designed to link individual records over time (such as anonymous identity resolution) without exposing protected health information are especially desirable (33). Conclusion A national distributed health data network can become an important asset to improving health and health care. A common core network would offer considerable advantages that would better support the needs of multiple users, such as the U.S. Food and Drug Administration (for their Sentinel System) and the Agency for Healthcare Research and Quality (for their comparative effectiveness network), than would building individual networks for each of these uses. The similarities in data needs and uses, coupled with potential savings of time and effort, favor a single, multipurpose network. In addition, local data holder control over use and access would encourage particip

[1]  A Nelson,et al.  National Bioterrorism Syndromic Surveillance Demonstration Program. , 2004, MMWR supplements.

[2]  Steven R. Simon,et al.  Laboratory monitoring of drugs at initiation of therapy in ambulatory care , 2005, Journal of General Internal Medicine.

[3]  Isaac S Kohane,et al.  Model Formulation: A Self-scaling, Distributed Information Architecture for Public Health, Research, and Clinical Care , 2007, J. Am. Medical Informatics Assoc..

[4]  Robert L Davis,et al.  Baseline Laboratory Monitoring of Cardiovascular Medications in Elderly Health Maintenance Organization Enrollees , 2005, Journal of the American Geriatrics Society.

[5]  Richard Platt,et al.  Potential population‐based electronic data sources for rapid pandemic influenza vaccine adverse event detection: a survey of health plans , 2008, Pharmacoepidemiology and drug safety.

[6]  K. Chan,et al.  Development of a Multipurpose Dataset to Evaluate Potential Medication Errors in Ambulatory Settings , 2005 .

[7]  Robert L Davis,et al.  Monitoring of drugs with a narrow therapeutic range in ambulatory care. , 2006, The American journal of managed care.

[8]  R Platt,et al.  Multicenter epidemiologic and health services research on therapeutics in the HMO Research Network Center for Education and Research on therapeutics , 2001, Pharmacoepidemiology and drug safety.

[9]  Robert L Davis,et al.  FDA drug prescribing warnings: is the black box half empty or half full? , 2006, Pharmacoepidemiology and drug safety.

[10]  Clay Shirky,et al.  Collecting and sharing data for population health: a new paradigm. , 2009, Health affairs.

[11]  David W. Bates,et al.  Application of Information Technology: Health Care IT Collaboration in Massachusetts: The Experience of Creating Regional Connectivity , 2005, J. Am. Medical Informatics Assoc..

[12]  R. Platt,et al.  Frequency of Serum Creatinine Monitoring During Allopurinol Therapy in Ambulatory Patients , 2006, The Annals of pharmacotherapy.

[13]  J. Avorn,et al.  A review of uses of health care utilization databases for epidemiologic research on therapeutics. , 2005, Journal of clinical epidemiology.

[14]  R. Platt,et al.  Laboratory Evaluation of Potassium and Creatinine Among Ambulatory Patients Prescribed Spironolactone: Are We Monitoring for Hyperkalemia? , 2007, The Annals of pharmacotherapy.

[15]  P. Rogers Financial conflicts of interest , 2005 .

[16]  Richard Platt,et al.  Use of prescription medications with a potential for fetal harm among pregnant women , 2006, Pharmacoepidemiology and drug safety.

[17]  Richard Platt,et al.  Laboratory monitoring of potassium and creatinine in ambulatory patients receiving angiotensin converting enzyme inhibitors and angiotensin receptor blockers , 2007, Pharmacoepidemiology and drug safety.

[18]  Nikki M. Carroll,et al.  Liver and Thyroid Monitoring in Ambulatory Patients Prescribed Amiodarone in 10 HMOs , 2006, Journal of managed care pharmacy : JMCP.

[19]  John W. Glasser,et al.  Vaccine Safety Datalink project: a new tool for improving vaccine safety monitoring in the United States. The Vaccine Safety Datalink Team. , 1997, Pediatrics.

[20]  Richard Platt,et al.  Distributed data processing for public health surveillance , 2006, BMC public health.

[21]  R. Platt,et al.  Outpatient use of cardiovascular drugs during pregnancy , 2008, Pharmacoepidemiology and drug safety.

[22]  Sarah M. Greene,et al.  Building a virtual cancer research organization. , 2005, Journal of the National Cancer Institute. Monographs.

[23]  Jeff Jonas Identity resolution: 23 years of practical experience and observations at scale , 2006, SIGMOD Conference.

[24]  R. Platt,et al.  Use of antidepressant medications during pregnancy: a multisite study. , 2008, American journal of obstetrics and gynecology.