Database Standardization, Linkage, and the Protection of Privacy

In writing this summary, I was tempted to call on one of the mantras of the current generation: reduce, reuse, recycle. The authors of the preceding three articles all seek to reduce the need for expensive and time-consuming clinical trials that may be difficult or impossible to conduct by reusing data already available in existing databases and by recycling it into products not envisioned when the data were originally collected. The views expressed in the paper by Gostin suggest that perhaps this mantra should be expanded from three Rs to four in the context of health care data: reduce, reuse, recycle responsibly. Because of the need to address pressing health care issues, data collected for one reason can and should be used in ways not originally intended; however, researchers must consider the legal and ethical implications of such use. Irresponsible reuse and recycling is a recipe for failure; of necessity, it will lead to decreased means by which researchers can address important health-related questions. McDonald and colleagues concentrate on barriers to reusing and recycling clinical data. Distinguishing between operational data (patient information gathered in direct support of patient care) and analytic data, they identify two major barriers to the use of operational data in health care research: differences in structure between operational and analytic databases and variation in coding across different database systems. They propose that the first barrier can be overcome by selective and standardized definitions of analytic variables based on operational data. Overcoming the second barrier requires substantially more investment; the authors recommend that nonstandard coding systems should be mapped to standard ones, such as LOINC (Logical Observations Identifier Names and Codes) and SNOMED (Systematized Nomenclature of Medicine). Standardization in both these areas would increase our ability to effectively reuse and recycle readily available clinical data. The power of recycling to address important health-related questions can also be enhanced by database linkage, whereby databases with different characteristics are connected to leverage the strengths of each. Lillard and Farmer discuss issues related to the linking of Medicare data with data obtained in health or demographic surveys in the context of research on older persons. For example, data on uncovered medical services and important dimensions of cost are missing from Medicare claims, but this information can be acquired by surveys. Similarly, it is difficult for the survey mechanism to obtain accurate information on health care utilization rates and the cost of reimbursable health services-information that is better obtained from Medicare records. The two data sources together can provide a more comprehensive picture of health and health care costs than can either individually. In general, linked data are more powerful tools for research. Increased power must have its drawbacks, and Gostin sounds the cautionary note. He points out that systematic and standardized collection of health-related data results in a substantial tradeoff in loss of privacy. An extensive health information infrastructure leads to increased opportunities for inappropriate use by authorized users, as well as the potential for access and exploitation by unauthorized parties. Automation is no panacea: it can be used to improve the security of computerized data, but the increasing ease with which electronic data can be disseminated and linked also increases the potential for abuse. Gostin argues that current law is inadequate to protect against misuse of increasingly comprehensive, electronic medical records. Considered together, these three papers suggest that the path to the future of health care research using computerized data is paved, but some sizable potholes remain. Unless severely restricted by new laws or privacy safeguards, standardization and linkage will result in ever more powerful tools for health care research. In turn, these tools must be treated ever more responsibly by their end users. My experience as a biostatistician suggests that researchers who work with health care data hold varying attitudes toward these data, spanning the spectrum from cautious to cavalier. Even among biostatisticians, little attention has historically been paid to issues of data integrity and security [1]. The competing and complementary principles of reduce, reuse, recycle responsibly are still being negotiated among the many stakeholders in health care: patients, physicians, health care organizations, insurers, pharmaceutical companies, government agencies, and legislatures. How this tension will be resolved is still a matter for speculation. Emmanuel N. Lazaridis, PhD The Regenstrief Institute for Health Care; Indiana University Medical Center; Indianapolis, IN 46202