Generic Health Measurement: Past Accomplishments and a Measurement Paradigm for the 21st Century

Over the past 30 years, researchers have generated numerous tools that use self-reporting to measure functional status, emotional well-being, and subjective perceptions of health [1-22]. Uses of generic surveys have increased dramatically in recent years as a result of the outcomes movement. Some authors have advocated the routine inclusion of data on generic health status in large databases [23], but others, such as Liang and Shadick [24], caution that the utility of doing so remains largely unproven. The purpose of this paper is to describe the history of generic health measurement and to suggest a more modern measurement paradigm for the 21st century. The conjoint use of computerized adaptive testing and item response theory offers distinct advantages for health outcomes assessment that could improve the feasibility and utility of including patient-centered data in large administrative databases. The Evolution of Generic Health Measurement Figure 1 presents a timeline of the evolution of generic health measures with respect to broader developments in health policy and health status assessment. Roughly coincidental with the publication of the World Health Organization's definition of health [25] was the emergence of clinically based, global rating scales whose content extended beyond organ function to encompass human function. Such measures as the Karnofsky performance status scale [26] and the function scale of the American Rheumatoid Association [27] were intended to supplement physiologic measures in an attempt to better understand treatment effectiveness. Around the same time, efforts to modernize national health indicators, including the incorporation of single-item indicators of activity limitations and perceived health in the National Health Interview Survey [28], were initiated [29]. Figure 1. Timeline of the evolution of generic health measures with respect to broader developments in health policy and health status assessment. The policy initiatives of the War on Poverty in the mid-1960s prompted two advances in health measurement. First, the social indicators movement ushered in measurement of quality of life in general populations [30, 31] and provided indicators of how well we lived, which were to be used with existing measures of how much we produced and spent [32]. Second, unified indexes of mortality and morbidity were developed for planning and evaluation purposes at the population health level [33-35]. A watershed for generic health assessment can be traced to the Human Population Laboratory, which launched measurement work in physical, mental, and social health [1-4]. As important, the Human Population Laboratory demonstrated that respondents will complete long surveys by mail [36], a finding that reduced the bias against mail surveys. In the 1970s, the development of generic tools proliferated, in part as a result of extramural support from the National Center for Health Services Research. Definitional expansiveness was the signature of this era, and multi-item scales replaced single-item measures. The Quality of Well-Being Scale, developed for priority setting and program evaluation, represented a meaningful advance by measuring the value components of a social indicator of health [6, 37]. Next, the Sickness Impact Profile [7, 38] was developed for health care evaluation. The 136 items in this profile were obtained from patients, providers, and caregivers and yielded individual health profiles and summary scores. The McMaster Health Index Questionnaire [8, 39] followed. Intended for use in clinical and health services research, it measured physical, social, and mental health by using 59 items. The Health Perceptions Questionnaire [40] was constructed for use in health planning and evaluation and tapped the elusive realm of positive health. In 1979, health status measures for the adult general population emerged from the Health Insurance Experiment [9]. Next, the Nottingham Health Profile [10, 41] was developed for use in population surveys, clinical trials, and clinical practice. The 38 items in the Nottingham Health Profile tapped six health concepts and were derived from patients. The Duke Health Profile [11] was developed for use in research and clinical applications in primary care. The 63 items in this profile covered four health concepts and were obtained from the literature. In the early 1980s, development of new measures took a respite but health research increasingly applied existing measures [42-44]. Interest in methodologic issues increased [45-48]. By the mid-1980s, interest had developed in the use of generic tools in everyday clinical practice, largely because of research showing poor correspondence between clinician and patient ratings of function and well-being [49-51]. In addition, growing recognition of the biopsychosocial model [52] and its relevance to an aging population resulted in increased appreciation that the preservation of function and well-being is an important goal of medical care [53]. Clinical practice applications ushered in the era of practicality. Shorter tools were developed: The Functional Status Questionnaire consisted of 34 items [13], and the Dartmouth COOP Charts had 9 items [14]. These tools were developed with measurement priorities directed toward practical efficiency (for example, ease of administration and scoring), which was achieved at the expense of measurement precision [54, 55]. The most recent era of health measurement is that of psychometric efficiency, which has several underpinnings. First, the outcomes movement gained momentum after Ellwood's Shattuck lecture was published [23] and the Agency for Health Care Policy and Research was established in 1989. Large-scale studies of patient-based outcomes were imminent. Second, burdened by study costs that spanned outcomes ranging from pathophysiology to quality of life, the clinical trials community sought more economical measures of health status. Third, concerns about respondent burden among severely ill patients encouraged shorter surveys. The Medical Outcomes Study (MOS) Short Form (SF) 20 Survey [16, 56] was the first to surface. The 20 items derived largely from the Health Insurance Experiment and tapped six health concepts. Next emerged the Duke Health Profile [17], a 17-item survey that was empirically derived from the original Duke Health Profile. The SF-36 [21] developed out of the SF-20 and the 149-Item Functioning and Well-Being Profile, which measures 16 health concepts [19]. The SF-6 Survey, derived from the Functioning and Well-Being Profile, uses a single item to tap 6 health concepts [19]. The SF-12 Survey is an empirically derived short form of the SF-36 [22]. Over the past 30 years, we have greatly improved our measurement bandwidth in generic health assessment (the breadth of health dimensions measured). Many different health concepts are now measured across the armamentaria of generic tools, although specific surveys differ in bandwidth (for example, the Sickness Impact Profile measures 12 health concepts, whereas the McMaster Health Index Questionnaire measures just 3). However, many generic measures, even those with excellent bandwidth, still have problems of fidelity (that is, thoroughness and depth of measurement). Thus, although we now quantify many different dimensions of health, we often do so at the expense of precision. Overall, many generic tools lack the precision required for effective health care decision making. Precision is conceptualized here and elsewhere [57] as a property of a measure that encompasses both the range or depth of measurement and the number of distinct levels enumerated by a scale (fineness of specification). Prevailing Measurement Paradigm Generic health status tools have been developed in the group-testing tradition. The defining signature of group tests is the use of a fixed set of questions (items) for all respondents, regardless of the appropriateness of any specific item for a given individual respondent. Items in group tests are selected or written to represent a moderate range of activities at a moderate level of difficulty. The era of psychometric efficiency emphasized construction of generic measures containing as few items as possible. Acceptable standards of face and content validity and reliability with few items can best be achieved by selecting items that are fairly homogeneous. Thus, selected items are often in the middle range of item difficulty and are almost alternate forms of each other. These measurement standards have two consequences, both of which are evident in generic measures. First, fixed-length health surveys tend to bore healthier respondents (because they have to wade through items that are easy for them to do, such as bathing) and frustrate more impaired respondents (because they have to respond to items that are clearly impossible for them to do, such as running one block). Such complaints about generic surveys are common from respondents. Respondents do not object to survey length itself; rather, they are frustrated by redundant items and items that to them are of low salience and relevance [58-60]. Second, because item selection is geared toward the middle-of-the-road in content coverage and difficulty, the end points of the health continuum tend to be poorly defined. This yields ceiling effects for general populations and floor effects for more disabled populations. For many generic measures [54, 55, 61], score distributions are often highly skewed, such that a plurality of respondents are classified as being in a state of perfect health at or near to the ceiling of the scale. Very large ceiling effects (up to 70%) have been observed in general and primary care populations [41, 62-64]. Ceiling effects are more prevalent than floor effects because many generic tools represent health as the absence of limitations. Score imprecision has two principal consequences. First, it is impossible to distinguish a

[1]  M H Liang,et al.  Feasibility and Utility of Adding Disease-Specific Outcome Measures to Administrative Databases To Improve Disease Management , 1997, Annals of Internal Medicine.

[2]  C. McHorney,et al.  Evaluation of the MOS SF-36 Physical Functioning Scale (PF-10): II. Comparison of relative precision using Likert and Rasch scoring methods. , 1997, Journal of clinical epidemiology.

[3]  G. Karabatsos,et al.  Validation of the Functional Assessment of Multiple Sclerosis quality of life instrument , 1996, Neurology.

[4]  A. Tennant,et al.  Are we making the most of the Stanford Health Assessment Questionnaire? , 1996, British journal of rheumatology.

[5]  C. Bombardier,et al.  Measuring health in injured workers: a cross-sectional comparison of five generic health status instruments in workers with musculoskeletal injuries. , 1996, American journal of industrial medicine.

[6]  Bergstrom Ba Computerized adaptive testing for the national certification examination. , 1996 .

[7]  J. Ware,et al.  A 12-Item Short-Form Health Survey: construction of scales and preliminary tests of reliability and validity. , 1996, Medical care.

[8]  M. Abrahamowicz,et al.  The Quebec Back Pain Disability Scale: conceptualization and development. , 1996, Journal of clinical epidemiology.

[9]  T. Perneger,et al.  Validation of a French-language version of the MOS 36-Item Short Form Health Survey (SF-36) in young healthy adults. , 1995, Journal of clinical epidemiology.

[10]  R. Kessler,et al.  Measuring the effects of medical interventions. , 1995, Medical care.

[11]  J. Teresi,et al.  Item bias in cognitive screening measures: comparisons of elderly white, Afro-American, Hispanic and high and low education subgroups. , 1995, Journal of clinical epidemiology.

[12]  S. Kreiner,et al.  Rasch analysis in the development of a rating scale for assessment of mobility after stroke , 1995, Acta neurologica Scandinavica.

[13]  C. McHorney,et al.  Evaluation of the MOS SF-36 physical functioning scale (PF-10): I. Unidimensionality and reproducibility of the Rasch item scale. , 1994, Journal of clinical epidemiology.

[14]  P D Phelan,et al.  Measurement of functional severity of asthma in children. , 1994, American journal of respiratory and critical care medicine.

[15]  C. McHorney,et al.  Comparisons of the Costs and Quality of Norms for the SF-36 Health Survey Collected by Mail Versus Telephone Interview: Results From a National Survey , 1994, Medical care.

[16]  John J. Norcini,et al.  Computers in Physician Licensure and Certification: New Methods of Assessment , 1994 .

[17]  D. Dillman,et al.  EFFECTS OF QUESTIONNAIRE LENGTH, RESPONDENT-FRIENDLY DESIGN, AND A DIFFICULT QUESTION ON RESPONSE RATES FOR OCCUPANT-ADDRESSED CENSUS MAIL SURVEYS , 1993 .

[18]  W. Broadhead,et al.  Patient acceptance of two health status measures: the Medical Outcomes Study Short-form General Health Survey and the Duke Health Profile. , 1993, Family medicine.

[19]  A G Fisher,et al.  The assessment of IADL motor skills: an application of many-faceted Rasch analysis. , 1993, The American journal of occupational therapy : official publication of the American Occupational Therapy Association.

[20]  Gene A. Kramer,et al.  Setting a standard on the pilot National Board Dental Examination. , 1992, Journal of dental education.

[21]  C. Sherbourne,et al.  The MOS 36-Item Short-Form Health Survey (SF-36) , 1992 .

[22]  W. Fisher,et al.  Applying psychometric criteria to functional assessment in medical rehabilitation: II. Defining interval measures. , 1992, Archives of physical medicine and rehabilitation.

[23]  Anastasia E. Raczek,et al.  The validity and relative precision of MOS short- and long-form health status scales and Dartmouth COOP charts. Results from the Medical Outcomes Study. , 1992, Medical care.

[24]  S. Haley,et al.  A Hierarchical Model of Functional Performance in Rehabilitation Medicine , 1992 .

[25]  Judith L. Johnson,et al.  The Basic HIV Disease Knowledge Questionnaire: A Rasch-Scaled Instrument to Measure Essential HIV Knowledge , 1991, Psychological reports.

[26]  F. Gilbert,et al.  Development of a "Steps Questionnaire". , 1991, Journal of studies on alcohol.

[27]  P. De Boeck,et al.  Measuring the severity of depression through a self-report inventory. A comparison of logistic, factorial and implicit models. , 1991, Journal of affective disorders.

[28]  D. McArthur,et al.  Rasch analysis of functional assessment scales: an example using pain behaviors. , 1991, Archives of physical medicine and rehabilitation.

[29]  C. Tse,et al.  The Duke Health Profile: A 17-ltem Measure of Health and Dysfunction , 1990, Medical care.

[30]  J. Jobe,et al.  Cognitive laboratory approach to designing questionnaires for surveys of the elderly. , 1990, Public health reports.

[31]  A. Jette,et al.  Improving patient function: a randomized trial of functional disability screening. , 1989, Annals of internal medicine.

[32]  A. Stewart,et al.  Functional status and well-being of patients with chronic conditions. Results from the Medical Outcomes Study. , 1989, JAMA.

[33]  J. Jobe,et al.  Cognitive research improves questionnaires. , 1989, American journal of public health.

[34]  G. Webster,et al.  An Application of Item Response Theory to Certifying Examinations in Internal Medicine , 1988 .

[35]  A. Stewart,et al.  The MOS short-form general health survey. Reliability and validity in a patient population. , 1988, Medical care.

[36]  D J Balaban,et al.  Weights for Scoring the Quality of Well-being Instrument Among Rheumatoid Arthritics: A Comparison to General Population Weights , 1986, Medical care.

[37]  C. Bombardier,et al.  Auranofin therapy and quality of life in patients with rheumatoid arthritis. Results of a multicenter trial. , 1986, The American journal of medicine.

[38]  M. Bergner,et al.  A cross-cultural comparison of health status values. , 1985, American journal of public health.

[39]  D. Weiss Adaptive testing by computer. , 1985, Journal of consulting and clinical psychology.

[40]  R. Kaplan,et al.  The Costs and Effects of Behavioral Programs in Chronic Obstructive Pulmonary Disease , 1984, Medical care.

[41]  L. Rubenstein,et al.  Systematic biases in functional status assessment of elderly adults: effects of different data sources. , 1984, Journal of gerontology.

[42]  A. Stoddard,et al.  Use of a Surrogate for the Sickness Impact Profile , 1984, Medical care.

[43]  R. Deyo,et al.  Pitfalls in measuring the health status of Mexican Americans: comparative validity of the English and Spanish Sickness Impact Profile. , 1984, American journal of public health.

[44]  E. Nelson,et al.  Functional health status levels of primary care patients. , 1983, JAMA.

[45]  M. Bergner,et al.  A controlled randomized study of early cardiac rehabilitation: the Sickness Impact Profile as an assessment tool. , 1983, Heart & lung : the journal of critical care.

[46]  George W. Torrance,et al.  Application of Multi-Attribute Utility Theory to Measure Social Preferences for Health States , 1982, Oper. Res..

[47]  David J. Weiss,et al.  Improving Measurement Quality and Efficiency with Adaptive Testing , 1982 .

[48]  S. Jachuck,et al.  The effect of hypotensive drugs on the quality of life. , 1982, The Journal of the Royal College of General Practitioners.

[49]  E. Wagner,et al.  The Duke-UNC Health Profile: An Adult Health Status Instrument for Primary Care , 1981, Medical care.

[50]  M. Bergner,et al.  The Sickness Impact Profile: Development and Final Revision of a Health Status Measure , 1981, Medical care.

[51]  J. Mcewen,et al.  The development of a subjective health indicator. , 1980, Sociology of health & illness.

[52]  K. N. Williams,et al.  Overview of adult health measures fielded in Rand's health insurance study. , 1979, Medical care.

[53]  M. Chen The gross national health product: a proposed population health index. , 1979, Public health reports.

[54]  T. Heberlein,et al.  Factors affecting response rates to mailed questionnaires: A quantitative analysis of the published literature. , 1978 .

[55]  T. Bice Comments on Health Indicators: Methodological Perspectives , 1976, International journal of health services : planning, administration, evaluation.

[56]  M. Bergner,et al.  The sickness impact profile. Development of an outcome measure of health care. , 1975, American journal of public health.

[57]  K. S. Renne Measurement of social health in a general population survey. , 1974 .

[58]  D L Patrick,et al.  Toward an operational definition of health. , 1973, Journal of health and social behavior.

[59]  L. Breslow,et al.  A quantitative approach to the World Health Organization definition of health: physical, mental and social well-being. , 1972, International journal of epidemiology.

[60]  P. L. Berkman Measurement of mental health in a general population survey. , 1971, American journal of epidemiology.

[61]  L. Breslow,et al.  Measurement of physical health in a general population survey. , 1971, American journal of epidemiology.

[62]  D. Sullivan,et al.  A single index of mortality and morbidity. , 1971, HSMHA health reports.

[63]  J. R. Hochstim A Critical Comparison of Three Strategies of Collecting Data from Households , 1967 .

[64]  B S SANDERS,et al.  MEASURING COMMUNITY HEALTH LEVELS. , 1964, American journal of public health and the nation's health.

[65]  DAVID W. KALISCH,et al.  The National Health Survey , 1957, Social Service Review.

[66]  O. Steinbrocker,et al.  Therapeutic criteria in rheumatoid arthritis. , 1949, Journal of the American Medical Association.

[67]  D. Karnofsky,et al.  The use of the nitrogen mustards in the palliative treatment of carcinoma. With particular reference to bronchogenic carcinoma , 1948 .

[68]  M. Lunz,et al.  Validity of item selection: A comparison of automated computerized adaptive and manual paper and pencil examinations , 1996 .

[69]  L. Chambers The McMaster Health Index Questionnaire: an update , 1993 .

[70]  Rachel M. Rosser,et al.  A health index and output measure , 1993 .

[71]  F. Fields Computerized adaptive testing for NCLEX-PN. , 1992, The Journal of practical nursing.

[72]  C. Sherbourne,et al.  Preliminary Tests of a 6-Item General Health Survey , 1992 .

[73]  C. Sherbourne,et al.  Summary and Discussion of MOS Measures , 1992 .

[74]  L E Kazis,et al.  Health status reports in the care of patients with rheumatoid arthritis. , 1990, Journal of clinical epidemiology.

[75]  A. Williams EuroQol : a new facility for the measurement of health-related quality of life , 1990 .

[76]  P. Ellwood,et al.  Shattuck lecture--outcomes management. A technology of patient experience. , 1988, The New England journal of medicine.

[77]  Nora Cate Schaeffer,et al.  An Application of Item Response Theory to the Measurement of Depression , 1988 .

[78]  A. Stewart,et al.  Assessment of function in routine clinical practice: description of the COOP Chart method and preliminary findings. , 1987, Journal of chronic diseases.

[79]  L. Cluff Chronic disease, function and the quality of care. , 1981, Journal of chronic diseases.

[80]  L. Chambers,et al.  Development and application of an index of social function. , 1976, Health services research.

[81]  D. Patrick,et al.  Social indicators for health based on function status and prognosis , 1973 .