To the Editor:
In a recent article, Schork and Greenwood (2004) made the alarming claim that nonparametric linkage analysis methods have a previously unrecognized inherent bias against detection of linkage and proposed that linkage studies that have used these methods should be reexamined. It is fortunate for the genetics community that this claim is not well founded. The “bias” discussed by Schork and Greenwood is simply conservative handling of incomplete information. This issue is well appreciated by statistical geneticists, and most nonparametric linkage analysis methods—as implemented in commonly used programs such as GeneHunter (Kruglyak et al. 1996), Merlin (Abecasis et al. 2002), and many other software packages—already handle incomplete information correctly (see Cordell [2004]). The examples to the contrary provided by Schork and Greenwood (2004) derive from a contrived statistic explicitly implemented by these authors to handle incomplete information incorrectly.
This is best illustrated with Schork and Greenwood’s (2004) example of testing whether a coin is fair. They write that if a coin is tossed 100 times, but the outcomes of only 50 tosses are observed, and 40 of these come up heads, then the estimate of the probability of heads is, of course, 0.80. They then write that if the 50 unobserved losses are assigned a 25-25 split expected of a fair coin, then the overall estimate of the probability of heads would be 0.65, which underestimates the true probability of heads and leads to a bias against detection of an unfair coin. This is, of course, true, and, for that very reason, no sound statistical procedure assigns a 25-25 split to the unobserved events. Rather, all correct missing-data–estimation procedures appropriately compute the probability of heads to be 0.80 in this example. Schork and Greenwood’s statistic, unlike real-world linkage statistics, implements the equivalent of the former (incorrect) procedure when faced with incomplete data (i.e., uninformative markers or evaluation of linkage between marker locations).
The method directly examined by Schork and Greenwood (2004) is based on the popular maximum LOD score (MLS) approach introduced by Risch (1990). In this approach, the fraction of alleles that are shared identical by descent (IBD) by affected pairs of relatives (the quantity represented by the probability of heads in the coin-toss analogy) is estimated by maximum likelihood, and significance is evaluated via a likelihood-ratio test. The expectation-maximization (EM) algorithm (Dempster et al. 1977) is most commonly used to account for incomplete specification of IBD sharing by the data. The EM algorithm, as originally described (Dempster et al. 1977) and when correctly implemented (e.g., by Kruglyak and Lander [1995]), computes the IBD-sharing estimates iteratively, using standard missing-data techniques to update the “imputed values” at each iteration, and provides an accurate and unbiased estimate of the fraction of alleles shared IBD (and the LOD score) at the final iteration (see Cordell [2004]).
The statistic used by Schork and Greenwood (2004) is superficially similar, but, unlike any statistical analysis in the widely used linkage-analysis programs, does not use EM but rather simply assigns to uninformative pairs the sharing fraction expected under the null hypothesis of no linkage, making no attempt to properly estimate the sharing for uninformative data under the alternative hypothesis of linkage. Although the authors do not describe in detail how they implemented the method, their equation (1) (as well as their definition of maximum-likelihood estimates for the IBD-sharing parameters) applies only to the case of fully informative pairs and is inappropriate for other cases. The appropriate formulation is clearly stated in the article by Risch (1990) that originally described the method, as well as in Kruglyak and Lander (1995).
It is important to note that, although we have focused on the case of the MLS approach and the EM algorithm, appropriate handling of incomplete information has been a key consideration in the design and implementation of other nonparametric linkage methods. For example, the problem of incomplete information in quantitative-trait analysis was explicitly addressed nearly a decade ago for sib pairs (Kruglyak and Lander 1995) and, more recently, for larger pedigrees (e.g., Sham et al. 2002), although several methods still in use today have not fully accounted for this issue, and users should be cognizant of this fact (Cordell 2004). Also, although nonparametric linkage (NPL) analysis has always been recognized to be conservative when the data is not fully informative (Kruglyak et al. 1996), this problem has long been resolved either by calculating LOD scores (Kong and Cox 1997) or by estimating significance empirically through simulation (e.g., Kruglyak and Daly 1998), an approach that is becoming increasingly practical even for whole-genome scans. Other methods are examined in detail by Cordell (2004), who comes to similar conclusions. Of course, it is well appreciated that all linkage methods (and all statistical tests, in general) have lower power when faced with less informative data, but this broadly recognized effect is distinct from the “bias” claimed by Schork and Greenwood.
Schork and Greenwood (2004) also make a problematic statement about parametric linkage analysis. They correctly note that the contribution to the LOD score of completely uninformative families is zero—exactly the same as when such families are simply excluded from analysis—but then inexplicably conclude that “uninformative families detract from a linkage signal in parametric settings as well” (Schork and Greenwood 2004, p. 312). Since the final statistic in parametric analysis is simply the sum of individual family LOD scores, uninformative families, obviously, have absolutely no effect on the overall results.
In conclusion, the “bias” in linkage analysis claimed by Schork and Greenwood does not affect most modern nonparametric (and parametric) linkage analysis methods. The handling of incomplete information remainsan active area of research in some specialized linkage settings.
[1]
E. Lander,et al.
Complete multipoint sib-pair analysis of qualitative and quantitative traits.
,
1995,
American journal of human genetics.
[2]
G. Abecasis,et al.
Merlin—rapid analysis of dense genetic maps using sparse gene flow trees
,
2002,
Nature Genetics.
[3]
M. Daly,et al.
Linkage thresholds for two-stage genome scans.
,
1998,
American journal of human genetics.
[4]
N J Cox,et al.
Allele-sharing models: LOD scores and accurate linkage tests.
,
1997,
American journal of human genetics.
[5]
L Kruglyak,et al.
Parametric and nonparametric linkage analysis: a unified multipoint approach.
,
1996,
American journal of human genetics.
[6]
N. Risch,et al.
Linkage strategies for genetically complex traits. III. The effect of marker polymorphism on analysis of affected relative pairs.
,
1990,
American journal of human genetics.
[7]
Douglas W. Smith,et al.
Both rare and common polymorphisms contribute functional variation at CHGA, a regulator of catecholamine physiology.
,
2004,
American journal of human genetics.
[8]
Shaun Purcell,et al.
Powerful regression-based quantitative-trait linkage analysis of general pedigrees.
,
2002,
American journal of human genetics.
[9]
N. Schork,et al.
Inherent bias toward the null hypothesis in conventional multipoint nonparametric linkage analysis.
,
2004,
American journal of human genetics.
[10]
Heather J Cordell,et al.
Bias toward the null hypothesis in model-free linkage analysis is highly dependent on the test statistic used.
,
2004,
American journal of human genetics.