Health technology assessments and guidelines require the synthesis of evidence on several treatments of interest from several studies. Typically, such analyses are performed by using network meta-analysis (NMA), which provides a consistent set of treatment effect estimates so that coherent recommendations may be made (13). However, if the NMA estimates are imprecise, if studies included in the analysis have flaws in their conduct or reporting, or if concerns exist regarding relevance, the reliability of the NMA results may be in doubt. Therefore, analysts and decision makers need to assess the robustness of any conclusions based on the NMA to potential limitations in the included evidence. The framework developed by the GRADE (Grading of Recommendations Assessment, Development and Evaluation) Working Group, known as GRADE NMA (4, 5), has been proposed to address this task. A GRADE assessment rates the quality of evidence contributing to the treatment effect estimates for each pair of treatments as high, moderate, low, or very low across 5 domainsstudy limitations, imprecision, indirectness, inconsistency (heterogeneity), and publication biasand a qualitative summary judgment is formed (6). The GRADE handbook (7) states 2 different aims for this quality assessment, depending on whether the intended users are systematic reviewers or guideline developers. For systematic reviewers, the quality of evidence reflects the extent to which we are confident that an estimate of the effect is correct. For guideline developers, the quality of evidence reflects the extent to which our confidence in an estimate of the effect is adequate to support a particular recommendation. GRADE NMA reaches a judgment for each treatment comparison by considering the individual GRADE judgments for the direct and indirect evidence between each pair of treatments. However, this approach does not provide guideline developers with an assessment of the credibility of recommendations based on the NMA. Instead, it delivers a set of independent assessments of the confidence in the estimates for individual pairwise comparisons. Moreover, GRADE NMA suggests replacing the NMA estimates with the direct or indirect estimates if they have a higher quality rating, leading to a set of final estimates that are not consistent with each other and therefore cannot be used for rational decision making (4). For example, it would be possible to obtain estimates in which intervention A is better than B, B is better than C, but C is better than A. Not only is it possible for GRADE NMA to reach a set of conclusions that are logically incoherent, it also fails to indicate how evidence quality might affect the final recommendation. As such, although GRADE NMA may achieve the prescribed aim for systematic reviewers, it is inadequate for guideline developers. The GRADE NMA ratings describe how likely each comparison estimate is to differ from the truth, but the influence of evidence on the recommendation is not considered. For example, low-quality evidence that has negligible influence on the treatment recommendation should be of little concern, but more influential evidence should be scrutinized carefully, and confidence in the robustness of the recommendation may be diminished. Recent advances in GRADE guidance acknowledge that the influence of evidence is important, suggesting that there is no need to rate the indirect evidence when the [quality] of the direct evidence is high, and the contribution of the direct evidence to the network estimate is as least as great as that of the indirect evidence (5). However, this reasoning is applied only to each pairwise comparison, and influence on the overall decision is not considered. Furthermore, GRADE NMA quickly becomes infeasible as the number of treatments in the network increases, because the number of loops of evidence that must be assessed grows very large (as in the example of social anxiety disorder discussed later). An alternative, statistically rigorous extension of GRADE to NMA proposed by Salanti and colleagues (8) formally evaluates the influence of the direct evidence on each estimate and uses this to combine quality judgments from each piece of direct evidence into an overall quality assessment. This approach avoids the possibility of incoherent conclusions. Salanti GRADE has been implemented in the CINeMA (Confidence in Network Meta-analysis) Web application (9), which automates the statistical operations and facilitates the required judgment steps, making it feasible even in large networks with many evidence loops, because the quality and contribution of indirect evidence are accounted for automatically. Salanti GRADE clearly meets the aim of quality assessment for systematic reviews and does so in a much more rigorous manner than GRADE NMA. However, it still does not fully meet the aim of GRADE for guideline developers, because the quality assessments reflect the confidence in the NMA estimates, which does not necessarily translate into robustness of treatment recommendations: Evidence may be influential for an NMA result but may not actually change a decision (10). In addition, it does not detail how potential bias would change a recommendation and therefore is less useful to decision makers and guideline developers than the approach described here, which directly assesses the robustness of the recommendations based on an NMA. Network meta-analyses are based on data from studies of relative treatment effects. Both the study estimates and resulting NMA estimates may differ from the true effects of interest in the decision setting for 2 basic reasons: bias (systematic error) and sampling variation (random error). In the most general statistical sense, bias is any systematic departure from the truth. This may be a result of issues of internal validity (that is, study limitations) or external validity (affecting the generalizability of results into the decision setting). Sampling variation is captured by the CI, representing the uncertainty in the estimate, and typically is reduced as sample size increases. The issues addressed by the 5 GRADE domains all concern either bias (study limitations, inconsistency, indirectness, publication bias) or sampling variation (imprecision). From here on, we use the phrase change in the evidence to refer to any difference between an estimate and the true effect of interest, whether it is the result of bias or of sampling variation. Treatment recommendations may be made by using various decision criteria. The simplest is to choose the treatment with the best estimate on a particular outcome, such as the treatment with the highest mean reduction in pain, or on a composite outcome, such as a weighted average of outcomes (as in multicriteria decision making [11]) or net monetary benefit (12). Other formats include recommending the top few treatments with the highest mean estimates, treatments achieving a benefit above a certain cut point, the top treatments within a minimal clinically important difference, or a do not do recommendation against using the treatments with the worst outcomes. To determine the robustness of a treatment recommendation, we are concerned with whether there are plausible changes to the evidence that would translate into NMA estimates that lead to a different recommendation being reached. Threshold analysis is a standard form of sensitivity analysis used in health economics. It answers the question: How much would the evidence have to change before the recommendation changes? (13). In its basic form, we can simply rerun the NMA, iteratively changing the data until a new recommendation is reached (10). These changes to the data may be made at 1 of 2 levels: either by changing an estimate from a single study (which we refer to as a study-level threshold analysis) or by changing the combined evidence on a contrast (relative effect) between 2 treatments (a contrast-level threshold analysis). The result is a set of thresholds that describe how much each (study or contrast) data point could change before the recommendation changes, and what the revised recommendation would be. Investigators then can judge whether the evidence might plausibly change by more than the threshold amount in each direction to determine the robustness of the recommendation. For potential changes due to sampling variation, one may refer to the CI (or credible interval [CrI] if a Bayesian analysis was used) and whether it overlaps the threshold. For potential changes due to bias, a judgment of the plausible magnitude and direction of potential bias is required. If it is judged that the evidence could not plausibly change beyond the thresholds, then the recommendation is considered robust; otherwise, the recommendation is sensitive to plausible changes in the evidence. Recently, a more sophisticated algebraic approach was proposed that does not require several reruns of the NMA (14); it requires only that the user supply the NMA estimates and the decision criteria. This method is computationally much faster and offers additional flexibility: We can consider potential changes to individual study estimates or to a set of estimates on a treatment comparison and examine the impact of specific potential biases. The analysis is not limited to greatest efficacy decisions: We can consider how changes in the evidence affect any treatment rankingsfor example, to determine the robustness of a do not do decision for the worst treatmentand we can consider complex decision rules, such as those based on a minimal clinically important difference, and simple net benefit functions. An R (The R Foundation) package is available that makes the analysis quick and easy to conduct (https://cran.r-project.org/package=nmathresh) (14, 15). Threshold Analysis in Practice We illustrate threshold analysis in 2 practical examples taken from clinical guidelines produced by the National Institute for Health and Care Exc
[1]
A A Stinnett,et al.
Net Health Benefits
,
1998,
Medical decision making : an international journal of the Society for Medical Decision Making.
[2]
Gordon H Guyatt,et al.
Advances in the GRADE approach to rate the certainty in estimates from a network meta-analysis.
,
2018,
Journal of clinical epidemiology.
[3]
R Core Team,et al.
R: A language and environment for statistical computing.
,
2014
.
[4]
Anna Chaimani,et al.
Evaluating the Quality of Evidence from a Network Meta-Analysis
,
2014,
PloS one.
[5]
Nicola J Cooper,et al.
Novel methods to deal with publication biases: secondary analysis of antidepressant trials in the FDA trial registry database and related journal publications
,
2009,
BMJ : British Medical Journal.
[6]
Mohammad Hassan Murad,et al.
A GRADE Working Group approach for rating the quality of treatment effect estimates from network meta-analysis
,
2014,
BMJ : British Medical Journal.
[7]
Isabelle Boutron,et al.
A revised tool for assessing risk of bias in randomized trials
,
2016
.
[8]
Gordon H Guyatt,et al.
GRADE Evidence to Decision (EtD) frameworks: a systematic and transparent approach to making well informed healthcare choices. 1: Introduction
,
2016,
British Medical Journal.
[9]
S. Norris,et al.
GRADE Methods for Guideline Development: Time to Evolve?
,
2016,
Annals of Internal Medicine.
[10]
Douglas G. Altman,et al.
Models for potentially biased evidence in meta‐analysis using empirically based priors
,
2009
.
[11]
Nicky J Welton,et al.
Estimation and adjustment of bias in randomized evidence by using mixed treatment comparison meta‐analysis
,
2010
.
[12]
Nicky J Welton,et al.
Effects of study precision and risk of bias in networks of interventions: a network meta-epidemiological study.
,
2013,
International journal of epidemiology.
[13]
David J Spiegelhalter,et al.
Bias modelling in evidence synthesis
,
2009,
Journal of the Royal Statistical Society. Series A,.
[14]
Howard Balshem,et al.
GRADE guidelines: 3. Rating the quality of evidence.
,
2011,
Journal of clinical epidemiology.
[15]
E. Mayo-Wilson,et al.
Psychological and pharmacological interventions for social anxiety disorder in adults: a systematic review and network meta-analysis
,
2014,
The lancet. Psychiatry.
[16]
H. Schünemann,et al.
[GRADE Evidence to Decision (EtD) frameworks: a systematic and transparent approach to making well informed healthcare choices. 1: Introduction.]
,
2017,
Recenti progressi in medicina.
[17]
Sofia Dias,et al.
A threshold analysis assessed the credibility of conclusions from network meta-analysis
,
2016,
Journal of clinical epidemiology.
[18]
Deborah M Caldwell,et al.
Simultaneous comparison of multiple treatments: combining direct and indirect evidence
,
2005,
BMJ : British Medical Journal.
[19]
G. Guyatt,et al.
[GRADE Evidence to Decision (EtD) frameworks: a systematic and transparent approach to making well informed healthcare choices. 2: Clinical practice guidelines].
,
2018,
Gaceta sanitaria.
[20]
J. Sterne,et al.
The Cochrane Collaboration’s tool for assessing risk of bias in randomised trials
,
2011,
BMJ : British Medical Journal.
[21]
Jacob Cohen.
Statistical Power Analysis for the Behavioral Sciences
,
1969,
The SAGE Encyclopedia of Research Design.
[22]
Ethan M Balk,et al.
Influence of Reported Study Design Characteristics on Intervention Effect Estimates From Randomized, Controlled Trials
,
2012,
Annals of Internal Medicine.
[23]
H. Schünemann,et al.
[GRADE Evidence to Decision (EtD) frameworks: a systematic and transparent approach to making well informed healthcare choices. 2: Clinical practice guidelines.]
,
2018,
Recenti progressi in medicina.
[24]
G. Lu,et al.
Assessing Evidence Inconsistency in Mixed Treatment Comparisons
,
2006
.
[25]
Annett Wechsler,et al.
Applied Methods Of Cost Effectiveness Analysis In Healthcare
,
2016
.
[26]
Alejandra Duenas,et al.
Multiple criteria decision analysis for health technology assessment.
,
2012,
Value in health : the journal of the International Society for Pharmacoeconomics and Outcomes Research.
[27]
Sofia Dias,et al.
Sensitivity of treatment recommendations to bias in network meta‐analysis
,
2018,
Journal of the Royal Statistical Society. Series A,.
[28]
T. Lumley.
Network meta‐analysis for indirect treatment comparisons
,
2002,
Statistics in medicine.