Journal of Heredity Advance Access originally published online on December 23, 2004
Journal of Heredity 2005 96(2):85-88; doi:10.1093/jhered/esi017
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
On the Estimation of Genome-wide Heterozygosity Using Molecular Markers
From the Department of Forestry and Natural Resources, Purdue University, West Lafayette, IN 479071159
Address correspondence to Andrew DeWoody at the address above, or e-mail: dewoody{at}purdue.edu.
| Abstract |
|---|
|
|
|---|
Coltman and Slate (2003) recently performed a meta-analysis on studies that investigated the association between genetic variation at microsatellite loci and phenotypic trait variation. One factor not explicitly addressed in their meta-analysis is the actual estimation of genome-wide heterozygosity via molecular markers. Many authors still associate marker-estimated heterozygosity with genome-wide heterozygosity, despite allozyme-based evidence that such correlations are usually very weak or nonexistent. Here, we show that genome-wide heterozygosity is poorly estimated not only by allozymes but also by microsatellite loci and by single-nucleotide polymorphisms (SNPs). Thus, associations between fitness (or other phenotypes) and heterozygosity should be established firmly on causative factors and not on simple correlations.
Correlations between evolutionary fitness and zygosity at marker loci have been documented in a wide variety of organisms (Coltman and Slate 2003). In general, the idea of a heterozygote advantage (i.e., overdominance) has received considerable support (Mitton 1997). Specifically, heterozygosity-fitness correlations fall into one of three primary categories (David 1998; Hansson and Westerberg 2002): (1) the "direct effect" hypothesis claims that a heterozygote advantage is specifically due to the assayed loci (e.g., enzyme polymorphisms); (2) the "local effect" hypothesis claims that marker loci are closely linked to fitness loci; and (3) the "general effect" hypothesis claims that a heterozygote advantage is conveyed not by the scored loci (or tightly linked loci) but by genome-wide effects. Here, the primary focus is on heterozygosity-fitness correlations (HFCs) due to the general effectthat is, genomic zygosity.
The mean level of individual heterozygosity across all loci in the genome is a parameter, H, that can be estimated with a suite of molecular markers. For example, heterozygosity can be measured at each of 20 allozymes, and the mean heterozygosity across these 20 loci can be represented as h. Molecular markers can be used to search for genomic heterozygosity-fitness correlations (HFCs) if h provides a robust estimate of H.
Genomic (i.e., genome-wide) HFCs are often reported in the literature, and this is somewhat surprisingnot necessarily because heterozygosity and fitness are unrelated, but because of known problems in estimating H using only a few molecular markers. Twenty-five years ago, Mitton and Pierce (1980) used computer simulations to show that correlations between H and individual heterozygosity as estimated by molecular markers (h) are generally quite low. Shortly thereafter, Chakraborty (1981) provided an analytical formula to calculate the expected correlation between the parameter H and statistic h; he too found that genome-wide heterozygosity is poorly estimated using a suite of < 20 independent loci.
Although the original work of Mitton and Pierce (1980) and Chakraborty (1981) was based on conventional protein markers such as allozymes, their results are applicable to all kinds of molecular markers. Assuming mutation-drift equilibrium, Chakraborty (1981) showed that the expected correlation (designated
) between H and h can be calculated as:
![]() | (1) |
respond to changes in h and/or r? Clearly, these variables differ with regard to assorted marker systems. Below, we compare microsatellites and single-nucleotide polymorphisms to allozymes. Mean heterozygosity in vertebrates is an order of magnitude higher at microsatellite loci than at allozyme loci (DeWoody and Avise 2000). Unfortunately, correlations between H and h actually decline slightly as heterozygosity increases (Figure 1). For example, a sample of 20 homozygous markers (i.e., mean h = 0.0) from a genome consisting of 50,000 genes produces an expected correlation between H and h of 0.0245, whereas 20 heterozygous markers (i.e., mean h = 1.0) produce an expected correlation of 0.0200 (Chakraborty 1981). This means that, on average, 20 allozyme markers will give marginally better estimates of H than will 20 microsatellite markers. Thus, genome-wide HFCs may be slightly stronger (albeit still generally tenuous) in allozyme studies than in microsatellite studies.
|
In theory, the correlation (
) between h and H depends on n, r, and h; but in practice,
is determined primarily by effort (r/n). It can be shown analytically that
is a decreasing function of heterozygosity;
is largest when h = 0 and smallest when h = 1 (Figure 1). That is, the minimal correlation between h and H occurs when h = 1:
![]() | (2) |
![]() | (3) |
remains tightly constrained by effort:
![]() |
Heterozygosity (h) and effort (r/n) differ not only between allozymes and microsatellites, but also with regard to single-nucleotide polymorphisms (SNPs). SNPs are potentially attractive for correlating heterozygosity with fitness because their average heterozygosity is quite low, while effort is high relative to allozymes and microsatellites. SNPs are usually biallelic, and the allele frequencies typically are skewed so that one allele is rare ("minor") and one ("major") common (Glaubitz et al. 2003; Marth et al. 2001). Minor allele frequencies often range from 0.01 to 0.20; and thus, expected heterozygosity under Hardy-Weinberg equilibrium conditions ranges from about 2% to 32%. This is significantly lower than expected heterozygosity at a typical microsatellite locus but roughly equivalent to allozyme loci. However, for the purpose of heterozygosity-fitness association studies, the main advantage to SNPs is the number of loci that can be genotyped.
SNPs now can be assayed at hundreds or even thousands of loci in model organisms (Kwok 2001). This could, in principle, make SNPs very attractive, because the correlation between H and h is dictated primarily by the proportion of loci sampled. Unfortunately for empiricists, Figure 2 shows that the correlation between h and H is weak when the r/n ratio falls below 0.1 and becomes even more tenuous when the r/n ratio drops below 0.01. This means that for a genome with
30,000 genes (as in humans), a herculean survey of 3,000 markers will produce a modest correlation of less than 0.40. More realistic surveys of a few dozen markers in organisms with similar genome sizes results in correlation coefficients < 0.05 (Figure 2).
|
Given the weak correlations between h and true genome-wide H, how then can we explain HFCs generated from molecular markers? Widespread reporting of genomic HFCs is due in part to publication bias (see Coltman and Slate 2003). In truth, most researchers have reported "hfCs," correlations between marker-estimated heterozygosity (h) and some correlate of fitness (f) (Figure 3). In terms of the three different hypotheses considered by Hansson and Westerberg (2002), the "direct effect" and the "local effect" rely on correlations between h and the statistic f or the parameter F (overall fitness), whereas the "general effect" relies on correlations between H and f or F. In terms of Figure 3, the null hypothesis for a direct or local effect would be that
2 or 5 = 0, whereas the null for the general effect hypothesis would be that
4 or 6 = 0.
|
Empirical correlations may be reasonable under the "direct effect" hypothesis for protein-coding or SNP loci, but they are much less tenable for microsatellites or other neutral markers. The "local effect" hypothesis, however, is plausible for any marker systembut its advocates face the burden of ruling out the unpalatable possibility of spurious correlation. (Avoiding spurious correlations is not impossible, but this requires a rigorous experimental design incorporating many markers of various types and, often, an accurate pedigree; see Hansson et al. (2004) for an exceptional example of a "local effect".) The "general effect" hypothesis is supported only when "hfCs" accurately represent "HFCs".
Given the poor correlation between h and H, the sampled correlation (r2) between h and f must be very strong to detect a significant population-level correlation (
4) between H and F. If we assume, for simplicity's sake, that the statistic f accurately represents the parameter F, and yet we account for the poor correlation between h and H, then the sample correlation between H and f can be estimated as r6
r2
1 (see Appendix). The situation deteriorates rapidly as we further extrapolate to sample correlations between genome-wide heterozygosity (H) and overall fitness (F). We are left to conclude that most published "HFCs" are in truth "hfCs" and that the discrepancy between the two is largely due to the difference between marker-based heterozygosity and genome-wide heterozygosity.
In summary, our ability to detect significant heterozygosity-fitness correlations is constrained by our ability to estimate genome-wide heterozygosity. This is unlikely to change until we can develop high-density genetic maps that reflect recombination rates and subsequently select markers based on their genomic distribution (e.g., haplotype blocks; see Wall and Pritchard [2003]). For those who work on nonmodel organisms, the prospects for estimating individual genomic heterozygosity with a few randomly distributed molecular markers remain bleak.
| Appendix |
|---|
|
|
|---|
As in Figure 3, to test for a significant correlation between genome-wide heterozygosity and a correlate of fitness (Ho:
6 = 0), one needs to compute the sample correlation (r6) between H and f:
![]() | (4) |
One can estimate SHH and SHf by utilizing the expected correlation between H and h (
1 given by equation [1]) and the asymptotic result that the least squared regression line approaches the identity line (i.e., h = H). Therefore, the least squared regression slope, b = SHh/Shh, converges to unity as the number of markers assayed approaches the number of loci in the genome. It now follows that
![]() |
are both reasonable and asymptotically true. Substitution of these estimates into equation (4) gives
![]() |
1 in order to reflect genome-wide heterozygosity. | Acknowledgments |
|---|
We thank J. Avise, D. Bos, J. Busch, J. Glaubitz, J. Rudnick, D. Triant, S. Turner, and R. Williams for their input. This research was supported in part by Purdue University and by a U.S. Department of Agriculture National Research Initiative grant (#2003-03616). This is publication #ARP17260 from the School of Agriculture at Purdue University.
| Footnotes |
|---|
Corresponding Editor: Brian Bowen
Received June 1, 2004
Accepted August 15, 2004
| References |
|---|
|
|
|---|
-
Chakraborty R, 1981. The distribution of the number of heterozygous loci in an individual in natural populations. Genetics 98:461466.
Coltman DW and Slate J, 2003. Microsatellite measures of inbreeding: a meta-analysis. Evolution 57:971983.[CrossRef][Web of Science][Medline]
David P, 1998. Heterozygosity-fitness correlations: new perspectives on old problems. Heredity 80:531537.
DeWoody JA and Avise JC, 2000. Microsatellite variation in marine, freshwater, and anadromous fishes compared with other animals. J Fish Biol 56:461473.[CrossRef]
Glaubitz JC, Rhodes OE, and DeWoody JA, 2003. Prospects for inferring pairwise relationships with single nucleotide polymorphisms. Mol Ecol 12:10391047.[CrossRef][Medline]
Hansson B and Westerberg L, 2002. On the correlation between heterozygosity and fitness in natural populations. Mol Ecol 11:24672474.[CrossRef][Medline]
Hansson B, Westerdahl H, Hasselquist D, Akesson M, and Bensch S, 2004. Does linkage disequilibrium generate heterozygosity-fitness correlations in great reed warblers?. Evolution 58:870879.[CrossRef][Web of Science][Medline]
Kwok P-Y, 2001. Methods for genotyping single nucleotide polymorphisms. Annu Rev Genomics Hum Genet 2:235258.[CrossRef][Web of Science][Medline]
Marth G, Yeh R, Minton M, Donaldson R, Li Q, Duan SG, Davenport R, Miller RD, and Kwok P-Y, 2001. Single-nucleotide polymorphisms in the public domain: how useful are they?. Nat Genet 27:371372.[CrossRef][Web of Science][Medline]
Mitton JB and Pierce BA, 1980. The distribution of individual heterozygosity in natural populations. Genetics 95:10431054.
Mitton JB, 1997. Selection in natural populations Oxford: Oxford University Press.
Wall JD, and Pritchard JK, 2003. Assessing the performance of the haplotype block model of linkage disequilibrium. Am J Hum Genet 73:502515.[CrossRef][Web of Science][Medline]
This article has been cited by other articles:
![]() |
J. A. Ivy, A. Miller, R. C. Lacy, and J. A. DeWoody Methods and Prospects for Using Molecular Data in Captive Breeding Programs: An Empirical Example Using Parma Wallabies (Macropus parma) J. Hered., July 1, 2009; 100(4): 441 - 454. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. K. Townsend, A. B. Clark, K. J. McGowan, E. L. Buckles, A. D. Miller, and I. J. Lovette Disease-mediated inbreeding depression in a large, open population of cooperative crows Proc R Soc B, June 7, 2009; 276(1664): 2057 - 2064. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. M. Aparicio, J. Ortego, and P. J. Cordero Can a Simple Algebraic Analysis Predict Markers-Genome Heterozygosity Correlations? J. Hered., January 1, 2007; 98(1): 93 - 96. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. M. Waser and J. A. De Woody Multiple paternity in a philopatric rodent: the interaction of competition and choice Behav. Ecol., November 1, 2006; 17(6): 971 - 978. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||












