Journal of Heredity 2003:94(5)
© 2003 The American Genetic Association 94:421-424
Brief Communication |
Uniformly Minimum Variance Unbiased Estimation of Gene Diversity
From the Department of Epidemiology, Box 189, University of Texas, M. D. Anderson Cancer Center, 1515 Holcombe Blvd., Houston, TX 77030.
Address correspondence to Dr. Shete at the address above, or e-mail: sshete{at}mdanderson.org.
| Abstract |
|---|
|
|
|---|
Gene diversity is an important measure of genetic variability in inbred populations. The survival of species in changing environments depends on, among other factors, the genetic variability of the population. In this communication, I have derived the uniformly minimum variance unbiased estimator of gene diversity. The proposed estimator of gene diversity does not assume that the inbreeding coefficient is known. I have also provided the approximate variance of this estimator according to Fisher's method. In addition, I have developed a numerical resampling-based method for obtaining variances and confidence intervals based on the maximum likelihood estimator and the uniformly minimum variance unbiased estimator. Efficiency in estimation of the gene diversity based on these two estimators is discussed. In accordance with the simulation results, I found that the uniformly minimum variance estimator developed in this report is more accurate for estimation of gene diversity than the maximum likelihood estimator.
The genetic variability of a population is an important quantity and can be measured in different ways. The extent to which a population can adapt to a changing environment is determined by the population's genetic variation (Brown et al. 1989). One way to measure this variability is by studying polymorphisms of genetic markers. The degree of polymorphism in a population can be measured by determining the proportion of individuals heterozygous for a marker and can be averaged over several loci to obtain the heterozygosity of the population. Another measure of polymorphism is Nei's proportion of informative families (Nei 1979). The same measure was introduced as a polymorphism information content value (Botstein et al. 1980), which was defined for a codominant marker used in a linkage study of families that had a rare dominant disease and had one heterozygous parent. Guo and Elston (1999) have shown that this measure is relevant, regardless of the mode of disease inheritance.
These measures of variation are suitable for random-mating populations, but for inbred populations, such as some plant and animal populations, in which variation is due to the presence of different homozygotes, gene diversity (Nei 1973) is a more appropriate measure of variation (Weir 1996). The genetic diversity within a population is the variety of subpopulations that comprise it. Mean and variance of an unbiased estimator of gene diversity in random-mating populations is known (Nei 1987; Nei and Roychoudhury 1974). The two measures of variability, heterozygosity and gene diversity, are the same for random-mating populations but can be significantly different in inbred populations. For example, in inbred populations there may be very few heterozygotes but several different homozygotes. Efficient methods of estimating gene diversity in populations are of vital importance to evolutionary biologists. Sample gene diversity is a maximum likelihood estimator (MLE), but it is biased. In this communication I derive the uniformly minimum variance unbiased estimator (UMVUE) of gene diversity. I also provide the approximate within-population variance of this estimator. This approximation is based on Fisher's delta method. Because the delta method ignores terms that are significant in the computation of variance of the estimators and depends on knowledge about inbreeding coefficient, I have developed a parametric bootstrap samplingbased method to obtain the variances and confidence intervals for the gene diversity estimators. The bootstrap estimator of variance and confidence interval does not require knowledge of inbreeding coefficient. From the simulation results, I conclude that the UMVUE is a more accurate estimator of gene diversity than the MLE and that the MLE usually underestimates gene diversity. I applied this method to the three esterase loci of Barley Composite Cross V (Weir et al. 1972).
| Materials and Methods |
|---|
|
|
|---|
Consider an inbred population with m loci. Let plu be the frequency of the uth allele at the lth locus; then the gene diversity for the lth locus is defined as
|
|
|
|
+ plu(1 - plu)f, and for u
v,
, where f is the within-population inbreeding coefficient, which measures the extent to which the genotypic frequencies depart from the expected frequencies under HardyWeinberg equilibrium. When f = 0 the genotypes are in HardyWeinberg equilibrium, and for completely selfing species f = 1.
Maximum Likelihood Estimation
Consider a sample of N diploid individuals (2N alleles). Let Nuv be the number of individuals with the genotype AuAv at the lth locus in this sample. We can then model the observed counts Nuv as a multinomial distribution and obtain MLE of plu:
lu = 2Nuu +
v
u Nuv/2N. Then, the MLE of Dl is given by
|
|
l) = [1 - (1 + f/2N)]Dl. Hence, the MLE is a biased estimator (Weir 1996). The MLE of D in (2) is obtained by replacing Dl with its MLE
l, which is also biased:
|
|
Weir (1989) used Fisher's delta method to obtain the variances of these estimators, as follows:
|
|
|
|
lu,l'v is the association between alleles at different loci. Equation (5) is identical to the formula for variance given in Nei (1987) when the inbreeding coefficient is zero.
Uniformly Minimum Variance Unbiased Estimation
Here I propose a new estimator of gene diversity and show that it is UMVUE of gene diversity. Consider
|
|
l) and the equation above, we can write
|
|
l - (1 -
uNuu/N)/2N] is the unbiased estimator of gene diversity Dl at the locus l. This can be rewritten as
|
|
l is a function of a complete, sufficient statistic and is unbiased, it follows from the Rao-Blackwell-Lehmann-Scheffé theorem that
l is the unique UMVUE of Dl (see Casella and Berger 1990, theorem 7.3.5). The estimator in (7) is N/(N - 1) times the denominator of the estimator of the inbreeding coefficient given in Weir (1996, equation 2.28). The UMVUE of D in (2) is obtained by replacing Dl with its UMVUE
l:
|
|
Approximate variance of the UMVUE is given in the Appendix. Approximate variances of these estimators (MLE, as well as UMVUE) based on Fisher's method ignore terms of order N-2, which are significant. For example, when a locus has equifrequent alleles, the estimated variances are zero. In addition, to calculate these variances we need to know the inbreeding coefficient, which is in general not known and has to be estimated from data. If this is done, the variance of gene diversity is substantially inflated, because the variance of the estimator of fl is known to be very large. So, I propose using a numerical resampling-based method, parametric bootstrapping (Efron and Tibshirani 1993), to obtain estimates of variances of both MLE and UMVUE. I also provide bootstrap confidence intervals. The variance and confidence intervals obtained with the bootstrap method proposed below do not assume knowledge of the inbreeding coefficient. The bootstrap variances and confidence intervals were obtained by the following steps:
- Basing them on the observed counts Nuv, I first obtained estimates of Puv. I then calculated
l and
l as given in (3) and (7), respectively.
- Next, I took a sample of size N from a multinomial distribution with frequencies
uv; calculated bootstrap estimates of gene diversity, based on the MLE and UMVUE estimation procedures; and denoted the bootstrap estimates based on this sample
bs and
bs, respectively.
- I then repeated step 2 for B bootstrap samples.
- I calculated the bootstrap estimates of variance of
l and
l by

- Then, I let
[i] and
[i] be the ith ordered bootstrap estimates. Then the 100(1 -
)% confidence interval is (
[B
/2],
[B(1 -
/2)]), based on MLE, and (
[B
/2],
[B(1 -
/2)]), based on UMVUE.
| Simulation Results and Discussion |
|---|
|
|
|---|
I performed simulations to study the performance of UMVUE developed in this report and MLE. I assumed an inbreeding coefficient (f) of 0.2 and considered two to seven equifrequent alleles. I used a sample size N of 100 and considered 1,000 bootstrap samples. The quantities reported in Table 1 are based on averages of 500 replicate samples. The results presented in Table 1 suggest that the UMVUE proposed here was more accurate than the MLE, which is a biased estimator. Overall, the MLE underestimated the gene diversity. The estimates of standard deviations were the same up to the fourth decimal place for both the UMVUE and MLE. Also, the 95% bootstrap confidence intervals based on
l do not contain the true diversities, whereas the 95% bootstrap confidence intervals based on
l do. Alternatively, one could use the symmetric confidence intervals based on asymptotic normal theory. I also performed the simulation for the extreme case of complete selfing (f = 1) and found that in this case also the MLE underestimated the true value and that the bootstrap intervals based on MLE did not contain the true gene diversity value, whereas bootstrap intervals based on UMVUE do (data not shown).
|
I applied the method developed here to three esterase loci, A, B, and C of Barley Composite Cross V (Weir et al. 1972). The data were reported in Weir (1996, Table 4.2). Weir et al. (1972) reported
A = 0.4984,
B = 0.1710, and
C = 0.4182. From the UMVUE developed in this report, I obtained
A = 0.4988,
B = 0.1711, and
C = 0.4185. The estimate of average gene diversity over these three loci were
= 0.3625 and
= 0.3628. The estimate of standard deviation based on the bootstrap method was 0.0053 for both
and
. These estimates were very close to the standard deviation (0.0055) obtained with Fisher's delta method. The bootstrap confidence intervals were (0.3530.371) based on MLE, and (0.3540.371) based on UMVUE. In conclusion, different populations have different genetic diversities, which have a significant effect on the risk of extinction of species. Endangered species typically have low genetic diversity and hence lower fitness for survival. Because genetic diversity is important, it is important to have accurate methods of estimating it. In this communication I have proposed the UMVUE of gene diversity, which is more accurate than the MLE and hence should be the method used in estimating gene diversity. The difference between the MLE and the UMVUE is not necessarily great, and, in fact, with large sample sizes, will be very small, but the UMVUE is quite clearly preferred. I also have provided an explicit formula for calculating the variance of this estimator based on Fisher's delta method. In addition, I also have developed a bootstrap method for computing the variances of UMVUE and MLE and the confidence intervals for gene diversity. The proposed estimator and bootstrap method for computing variance and confidence interval does not require knowledge of the inbreeding coefficient. A computer program that performs these computations is available freely from the author at http://www.epigenetic.org/Linkage/GENEDIVERSITY.
| Appendix |
|---|
|
|
|---|
Computation of the Variance of the Gene Diversity Estimator
I obtained the variance of the gene diversity estimator (
l) by using Fisher's delta method, which is given below.
|
|
v
uNuv)/2]2 - Nuu/2. Using the delta method,
|
|
|
|
l.
The variance of UMVUE of average gene diversity is obtained similarly by the delta method. From (8) and (9), we can write
|
|
lu,l'v - N2[pl'vDuuv + pluDuvv] +
[
- PuuPvv], where Duuv and Duvv are trigenic disequilibria, and
is the proportion of individuals who are homozygous for allele u at locus l and for allele v at the locus l' (Weir 1996).
| Acknowledgments |
|---|
I am grateful to Prof. Bruce S. Weir for helpful suggestions and comments on an earlier version of the article. I also thank Dr. Maureen Goode for comments that led to a better presentation of the material in this communication.
| Footnotes |
|---|
Corresponding Editor: Masatoshi Nei
Received July 23, 2002
Accepted May 29, 2003
| References |
|---|
|
|
|---|
-
Botstein D, White RL, Skolnick MH, Davies RW, 1980. Construction of a genetic linkage map in man using restriction fragment length polymorphisms. Am J Hum Genet. 32:314-331.[Web of Science][Medline]
Brown AHD, Clegg MT, Kahler AL, Weir BS, 1989. Plant population genetics, breeding and genetic resources. Sunderland, MA: Sinauer.
Casella G, Berger RL, 1990. Statistical inference. Belmont, CA: Duxbury.
Efron B, Tibshirani RJ, 1993. An introduction to the bootstrap. New York: Chapman and Hall.
Guo X, Elston RC, 1999. Linkage information content of polymorphic genetic markers. Hum Hered. 49:112-118.[CrossRef][Web of Science][Medline]
Lehmann EL, Casella G, 1998. Theory of point estimation. New York: Springer-Verlag.
Nei M, 1973. Analysis of gene diversity in subdivided populations. Proc Natl Acad Sci. 70:3321-3323.
Nei M, 1979. Proportion of informative families for genetic counselling with linked marker genes. Jinrui Idengaku Zasshi. 24:131-142.[Medline]
Nei M, 1987. Molecular evolutionary genetics. New York: Columbia University Press.
Nei M, Roychoudhury AK, 1974. Sampling variances of heterozygosity and genetic distance. Genetics. 76:379-390.
Weir BS, 1989. Sampling properties of gene diversity. In: Plant population genetics, 16 breeding and genetic resources (Brown AHD, Clegg MT, Kahler AL, and Weir BS, eds). Sunderland, MA: Sinauer; 2342.
Weir BS, 1996. Genetic data analysis II. Sunderland, MA: Sinauer.
Weir BS, Allard RW, Kahler AL, 1972. Analysis of complex allozyme polymorphisms in a barley population. Genetics. 72:505-523.
This article has been cited by other articles:
![]() |
M. DeGiorgio and N. A. Rosenberg An Unbiased Estimator of Gene Diversity in Samples Containing Related Individuals Mol. Biol. Evol., March 1, 2009; 26(3): 501 - 512. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||








