Claims of Equivalence in Medical Research: Are They Supported by the Evidence?

Most clinical research activities are aimed at showing that one agent or method is better than another. An increasing number of reports, however, now conclude that the investigated entities are equivalent. Thus, a new drug or treatment may be deemed just as effective as a standard therapy while being, for example, less costly or easier to use (1). Equivalence can also be claimed for generic versions of innovator drugs (2) and for such diverse entities as medical protocols (3), surgical techniques (4), and medical devices (5). As physicians, insurers, and hospitals put increasing emphasis on practicing evidence-based medicine, claims of substantial treatment benefit have come under scrutiny. In contrast, claims of therapeutic equivalence may not be reviewed with the same quantitative rigor. This can lead to patient harm if clinically inferior treatments are erroneously deemed equivalent to a standard approach or if potentially superior therapies are discarded as merely equivalent. Despite the many scientific and statistical procedures used to confirm that a large difference is significant, less attention has been given to the logic and methods for establishing equivalence (6). In the context of hypothesis testing, equivalence exists only as a theoretical entityan infinitely large sample size would be needed to unequivocally establish no difference between compared groups. In practice, an observed difference can be compared to a specified value considered small (that is, not clinically important). Unfortunately, there are no established gold standard criteria for how to construct and support such an equivalence claim. Proposed approaches include using confidence intervals to exclude clinically meaningful differences (7) or applying variations of the analytic strategy of rejecting the null hypothesis (8). In this context, testing for equivalence involves rejecting an alternative hypothesis of a large difference between examined groups or entities. Yet, investigations that report clinical equivalence may not use this approach. Instead, after finding a negative result in conventional tests for statistical significance (for example, P >0.05), investigators may declare that the entities compared are equivalent. Our current study was done to determine whether published claims of equivalence are supported by the methods used and the results obtained. Methods Study Sample Using the National Library of Medicine Medical Subject Heading (MeSH) term therapeutic equivalency and the text word equivalence, we did a structured MEDLINE search for English-language original research reports published from 1992 through 1996 in which equivalence was claimed. The MeSH term searching process identifies papers in which the National Library of Medicine has identified a main point in the text. The separate text word searching process indicates papers in which the selected word is used in the title or abstract. We used both of these search strategies to obtain a representative, but not exhaustive, sample of published reports claiming equivalence. Reports identified with the specific MeSH term therapeutic equivalency are more likely to be methodologically sound, and papers defined by the text word equivalence are more likely to state the aim of the study than if they had been selected from a more general word, such as equivalent or equal. We reviewed the obtained citations on-line to select, for further evaluation, those that appeared likely to describe original research claiming equivalence. We included randomized clinical trials as well as pertinent observational studies (for example, reference 9) because although these types of research are conducted by using different methods, we believe that the quantitative claim of equivalence can be substantiated similarly. After reviewing the title and, if necessary, the abstract, we discarded citations to papers, such as reviews, meta-analyses, or commentaries (for example, references 10 and 11) that did not report original data. We also excluded reports (for example, references 12 and 13) of purely laboratory or other nonhuman research, as well as those intended solely to show pharmacokinetic bioequivalence, such as generic drug applications to the U.S. Food and Drug Administration. Finally, we excluded papers that reported equivalence for anything other than patients' outcomes, such as a comparison of two radiation therapy regimens in which dose equivalence referred only to standardized radiation fractions (14). For the remaining potentially eligible citations, abstracts were further reviewed according to the preceding criteria. Whenever the suitability of a paper was uncertain, the entire text was reviewed. All reviews were done in a structured fashion by one author using a database (Microsoft Access for Office 97, Microsoft, Inc., Redmond, Washington) created for this study. Although rules for inclusion and categorization were defined a priori, decisions were not straightforward for approximately 10% of reports. These difficult papers were reviewed by all three investigators for a consensus decision. Evaluation of Papers The entire text of each included paper was evaluated in a structured fashion for five prespecified attributes relating to the assessment and ultimate claim of equivalency. The attributes evaluated are listed below, along with the justification for their choice. 1. Statement of research aim. A stated research goal is needed to allow the investigators to choose the pertinent variables for study, boundaries for the magnitude of an equivalent result, and cogent analytic methods. 2. Magnitude of reported differences. Evaluating the clinical sensibility of an equivalence claim requires knowledge of exactly what is being called a negligibly small difference. We therefore tabulated the actual values of the quantities that constituted the difference between the investigated groups. Although these quantities ranged broadly in units of measurement and extent of variability, differences can be standardized as an effect size (15), which is usually calculated as the direct increment between groups divided by the standard deviation in the control group. For example, if drug X is successful in 60% 30% (SD) of patients and placebo is successful in 40% 20% (SD), then the effect size is (60% 40%)/20%=1.0. The effect size can thus be considered a unit-free ratio of signal to noise. 3. Choice of quantitative boundary. We determined whether the investigators had set an a priori quantitative boundary for what would constitute equivalence in the reports examined. Demarcating a maximum value of small, beyond which a difference could no longer be deemed equivalent, is needed for investigators and readers to appraise the numerical results in a clinical-scientific as well as a statistical manner. The boundary could be set from an absolute difference in means or medians, a proportionate difference in results, a ratio of two-group results, or an effect size. Although a single criterion does not exist for establishing what is large, proportionate differences of 20% or greater between clinical groups have been suggested as potentially important (16). In addition, an effect size less than 0.20 has often been considered trivially small, 0.50 has been considered moderate, and 0.80 has been considered large (17). We are not advocating general use of these arbitrary thresholds, and neither the incremental difference between compared groups, the proportional difference, nor the effect size was used as a gold standard criterion for what would constitute an acceptable study for equivalence. These thresholds can, however, serve to describe the magnitude of observed differences in our investigation, and this information is included for illustrative purposes. 4. Method of statistical (stochastic) testing. We next determined what, if any, testing was done to support the claim of equivalence, and we specifically checked whether the claim of equivalence was tested directly or was supported only by a failed test for superiority. In a direct test, the differences observed between groups or patients are compared against a specific equivalence boundary. A direct test of equivalency is intended to reject the alternative hypothesis that the true difference is larger than the boundary limit, whereas statistical tests for superiority are aimed at rejecting a null hypothesis of no difference. 5. Calculation of sample size. We expected that information about sample sizes, including advance calculations, would be reported. Such information would help, for example, to explain a paradox in which a large observed difference is deemed equivalent because the sample size was too small to achieve statistical significance. This problem of inadequate statistical power (which may cause studies to miss important observed differences) has been highlighted previously (18). Quality or Impact of Journals Although the quality of the journal might be considered when methodologic rigor is evaluated for published research, we did not try to define a high-quality medical journal. Instead, we determined which papers came from the 119 Abridged Index Medicus (AIM) journals that the National Library of Medicine regards as selected biomedical journal literature of immediate interest to the practicing physician (19). The National Library of Medicine states that journals are included in AIM according to the quality of the journal, usefulness of journal content for the professional, and the need for providing coverage in the fields of clinical medicine (19). The list includes Annals of Internal Medicine, BMJ, JAMA, The Lancet, The New England Journal of Medicine, and many leading subspecialty journals but omits such nonclinical journals as Science, Nature, and Cell. Role of the Funding Source This study was funded by the Robert Wood Johnson Foundation through its Clinical Scholars Program. The funding source had no role in the collection, analysis, or interp

[1]  Walter W. Hauck,et al.  Bioequivalence of generic and brand-name levothyroxine products in the treatment of hypothyroidism. , 1997, JAMA.

[2]  R. Makuch,et al.  Sample size requirements for evaluating a conservative therapy. , 1978, Cancer treatment reports.

[3]  D. Haines,et al.  Does a posterior aneurysm increase the risk of endocardial resection? , 1992, The Annals of thoracic surgery.

[4]  J. Groothuis,et al.  Safety and bioequivalency of three formulations of respiratory syncytial virus-enriched immunoglobulin , 1995, Antimicrobial agents and chemotherapy.

[5]  Moclobemide versus clomipramine in depressed patients in general practice. A randomized, double-blind, parallel, multicenter study. , 1995, Journal of clinical psychopharmacology.

[6]  N. Cutler,et al.  A comparison of circulating hormone levels in postmenopausal women receiving hormone replacement therapy. , 1992, American journal of obstetrics and gynecology.

[7]  D. Barckow,et al.  Cefepime versus cefotaxime in the treatment of lower respiratory tract infections. , 1993, The Journal of antimicrobial chemotherapy.

[8]  M. Haugh,et al.  A randomized comparison of the effect of four antihypertensive monotherapies on the subjective quality of life in previously untreated asymptomatic patients: field trial in general practice , 1995, Journal of Hypertension.

[9]  L. Collette,et al.  Larynx preservation in pyriform sinus cancer: preliminary results of a European Organization for Research and Treatment of Cancer phase III trial. EORTC Head and Neck Cancer Cooperative Group. , 1996, Journal of the National Cancer Institute.

[10]  Feinstein Ar Zeta and delta: critical descriptive boundaries in statistical analysis. , 1998 .

[11]  J. Ware,et al.  Equivalence trials. , 1997, The New England journal of medicine.

[12]  P. P. Edgar,et al.  Functionally relevant gamma-aminobutyric acidA receptors: equivalence between receptor affinity (Kd) and potency (EC50)? , 1992, Molecular pharmacology.

[13]  R. Wilcox,et al.  Randomised, double-blind comparison of reteplase double-bolus administration with streptokinase in acute myocardial infarction (INJECT): trial to investigate equivalence International Joint Efficacy Comparison of Thrombolytics , 1995, The Lancet.

[14]  J A Lewis,et al.  Trials to assess equivalence: the importance of rigorous methods , 1996, BMJ.

[15]  J. Dean,et al.  Therapeutic bioequivalency study of brand name versus generic carbamazepine , 1992, Neurology.

[16]  Jacob Cohen Statistical Power Analysis , 1992 .

[17]  C. Dunnett,et al.  Significance testing to establish equivalence between treatments, with special reference to data in the form of 2X2 tables. , 1977, Biometrics.

[18]  J. Harenberg,et al.  Subcutaneous low-molecular-weight heparin versus standard heparin and the prevention of thromboembolism in medical inpatients. The Heparin Study in Internal Medicine Group. , 1996, Haemostasis.

[19]  A. Ruifrok,et al.  Comparison of continuous and pulsed low dose rate brachytherapy: biological equivalence in vivo. , 1994, International journal of radiation oncology, biology, physics.

[20]  R. Dahl,et al.  Assessing equivalence of inhaled drugs. , 1995, Respiratory medicine.

[21]  F. Werf A Comparison of Continuous Infusion of Alteplase with Double-Bolus Administration for Acute Myocardial Infarction , 1998 .

[22]  Lewis E. Kazis,et al.  Effect Sizes for Interpreting Changes in Health Status , 1989, Medical care.

[23]  G. Pasero,et al.  Deflazacort versus methylprednisolone in polymyalgia rheumatica: clinical equivalence and relative antiinflammatory potency of different treatment regimens. , 1995, The Journal of rheumatology.

[24]  Frans Van de Werf,et al.  An international randomized trial comparing four thrombolytic strategies for acute myocardial infarction. , 1993, The New England journal of medicine.

[25]  R. Hatala,et al.  Once-daily aminoglycoside dosing in immunocompetent adults: a meta-analysis. , 1996, Annals of internal medicine.

[26]  G. Leverger,et al.  Low-dose radiation therapy and reduced chemotherapy in childhood Hodgkin's disease: the experience of the French Society of Pediatric Oncology. , 1992, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[27]  T C Chalmers,et al.  The importance of beta, the type II error and sample size in the design and interpretation of the randomized control trial. Survey of 71 "negative" trials. , 1978, The New England journal of medicine.

[28]  G Barbash,et al.  Single-bolus tenecteplase compared with front-loaded alteplase in acute myocardial infarction: the ASSENT-2 double-blind randomised trial. , 1999, Lancet.

[29]  A. Woodcock,et al.  GR106642X: a new, non-ozone depleting propellant for inhalers , 1995, BMJ.

[30]  B. Lecoutre,et al.  Bayesian predictive approach for inference about proportions. , 1995, Statistics in medicine.

[31]  A. Feinstein XXXIV. The other side of ‘statistical significance’: alpha, beta. delta, and the calculation of sample size , 1975, Clinical pharmacology and therapeutics.

[32]  A. Spriet,et al.  When can 'non significantly different' treatments be considered as 'equivalent'? , 1979, British Journal of Clinical Pharmacology.

[33]  John O'quigley,et al.  General Approaches to the Problem of Bioequivalence , 1988 .

[34]  D G Altman,et al.  Absence of evidence is not evidence of absence. , 1996, Australian veterinary journal.

[35]  D. Moulin,et al.  Comparative Efficacy and Safety of Controlled‐Release Morphine Suppositories and Tablets in Cancer Pain , 1998, Journal of clinical pharmacology.

[36]  L. Kofoed,et al.  Therapeutic interchange of fluoxetine and sertraline: experience in the clinical setting. , 1994, American journal of hospital pharmacy.