Journal of Heredity Advance Access originally published online on September 20, 2006
Journal of Heredity 2006 97(5):483-492; doi:10.1093/jhered/esl030
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
A Comparison of Rarefaction and Bayesian Methods for Predicting the Allelic Richness of Future Samples on the Basis of Currently Available Samples
From the Laboratoire Génome, Populations, Interactions, Adaptation, UM2-IFREMER-CNRS UMR 5171, Université de Montpellier II, 34095 Montpellier, Cedex 5, France (Belkhir and Bonhomme); and Rothamsted Research, Harpenden, Hertfordshire AL5 2JQ, UK (Dawson)
Address correspondence to K. Belkhir at the address above or e-mail: belkhir{at}univ-montp2.fr.
Rarefaction methods have been introduced into population genetics (from ecology) for predicting and comparing the allelic richness of future samples (or sometimes populations) on the basis of currently available samples, possibly of different sizes. Here, we focus our attention on one such problem: Predicting which population is most likely to yield the future sample having the highest allelic richness. (This problem can arise when we want to construct a core collection from a larger germplasm collection.) We use extensive simulations to compare the performance of the Monte Carlo rarefaction (repeated random subsampling) method with a simple Bayesian approach we have developedwhich is based on the Ewens sampling distribution. We found that neither this Bayesian method nor the (Monte Carlo) rarefaction method performed uniformly better than the other. We also examine briefly some of the other motivations offered for these methods and try to make sense of them from a Bayesian point of view.
Methods for predicting and comparing allelic richness of future samples (or sometimes populations) on the basis of currently available samples, possibly of different sizes, continue to attract attention (see, e.g., Leberg 2002; Kalinowski 2004). However, the parameter of interest and the motivation for such methods are not always clear. For example, Leberg (2002) often fails to make a clear distinction between the problem of estimating the allelic richness of a population and the problem of predicting the allelic richness of a future sample from the population. In fact, Leberg (2002) leaves further doubt as to what is the parameter of interest. Is it the allelic richness of a particular locus or the "average" allelic richness of some set of loci? Which set of loci? Here, we focus on the problem of making predictions about allelic richness of future samples (at a single locus), on the basis of currently available samples (allele count data from the same locus). More specifically, the problem considered here is that of predicting which population is most likely to yield the future sample having the highest allelic richness.
The allelic richness of the population at a particular locus is the total number of allele types present in the population at that locus. This quantity is often denoted by A. If we let K(n) denote the number of distinct allele types represented in a sample of n allele copies sampled "without replacement" from the population, then K(n) may be referred to as the "allelic richness of a sample" of size n. If N is the census size of a (diploid) population, then clearly K(2n) = A (the allelic richness of this population). Note that for a population of known composition, A is a population parameter, whereas K(n) is a random variable for 1 < n < 2N. (Furthermore, K(1) = 1, and K(n)
A for 1 < n < 2N.)
The construction of a so-called "core collection" from a large germplasm collection is a situation where allelic richness is a relevant measure of diversity (Schoen and Brown 1993; Bataillon et al. 1996), because we want to retain as many adaptively significant allele types as possible in the core collection, where they will be available for phenotypic screening and breeding programs. Some of the other motivations offered for making inferences about the allelic richness of populations or future samples will be discussed briefly at the end of this paper.
Suppose that we can retain N' diploid genotypes (typically in the form of seeds) per population in the core collection. In that case, the relevant measure of diversity is K(n'), where n' = 2N'. If enough tissue for genotyping can be collected nondestructively, then we could simply sample N' diploid individuals from each candidate population and retain the "winning" sample in the core collection. Otherwise, we will have to predict which population will yield the highest value of the allelic richness K(n') when we draw a new sample of size n', and usually this prediction will have to be made on the basis of a smaller sample (of size n < n' allele copies) from each candidate population. Clearly, the relevant question is not which population has the highest allelic richness (the greatest number of distinct allele types)? Here, the relevant question must be: which population is most likely to yield the sample of size n' having the highest allelic richness? In general, the answers to these questions need not be the same.
If the samples from the candidate populations are all the same size, n, then it is reasonable to predict that the population that yielded the most allele types in a sample of n allele copies will also yield the most allele types in a future sample of n' allele copies. Furthermore, if the samples differ in size, and the smallest sample contains most allele types, then it is reasonable to predict that the population from which this smallest sample came will yield the most diverse sample of size n'. The cases which are more difficult to decide are those where the largest sample contains the most allele types. Even in the more straightforward situations, it is desirable to have some measure of our confidence in the decision which has been reached. Can this question be answered using the rarefaction method? Before addressing this question, we briefly review the history of rarefaction methods.
Rarefaction methods originated in community ecology. In order to facilitate comparison of species richness, Saunders (1968) proposed a deterministic scheme for discarding individuals (according to the frequencies of their types) from larger samples until all samples have been reduced to the size of the smallest. Saunders referred to his scheme as the "rarefaction" method. Hurlbert (1971) gave a formula for the expected species richness in a sample of any chosen size from a finite collection but did not propose any method for inferring the species richness of the collection from a sample. Subsequent authors (Heck et al. 1975; James and Rathbun 1981) used Hurlbert's formula to calculate the expected species richness of a smaller sample of a given size, obtained by sampling without replacement from a larger sample. If we have samples of differing sizes, we can calculate the expectation of the species richness when each sample is reduced to the size of the smallest sample by sampling without replacement. (Higher moments of the species richness can be computed similarly.) These expected species richnesses can then be compared with each other and with the observed species richness of a smallest sample. This approach has come to be known as the "rarefaction method" (whereas the original method of Saunders has become obsolete). El Mousadik and Petit (1996) introduced this method into population genetics and used it to compare the allelic richness of samples of differing sizes. (We examine their motivation for these comparisons later.) It is straightforward to simulate this distribution of subsamples of size n' simply by repeating the process of drawing a sample of size n' without replacement, from the same original sample of size n, a large number of times. From this simulated distribution, we can estimate the expected allelic richness of the subsample. In the same way, we can estimate the variance, or any other moment, of the subsample's allelic richness. Leberg (2002) referred to this method as "repeated random subsampling" (or "multiple random reductions"). Here we refer to it as the "Monte Carlo rarefaction method." Note that the Monte Carlo rarefaction method differs from bootstrap resampling, where the resampling is "with replacement."
It is obvious from its construction that the rarefaction method cannot provide an estimate of the allelic richness of the population. Repeatedly drawing samples of size n', from the same original sample of size n, could only be expected to provide a prediction of the allelic richness of a future sample which is also of the size n'. A serious objection to the rarefaction method (Monte Carlo or analytic) is that it fails to take account of our uncertainty about the composition of the population. (The allele composition of the initial sample can severely restrict the range of the sampling distribution of the allelic richness.)
The usual way of applying the rarefaction method (see, e.g., El Mousadik and Petit 1996; Leberg 2002) is to compare the expected allelic richness of samples of standard size (the size of the smallest sample) from each of the larger samples with each other and with the observed allelic richness of the smallest sample. The sampling variance of the allelic richness for the larger samples may be used to quantify uncertainty from a frequentist point of view. Here we have applied the rarefaction method in an alternative way, which facilitates direct comparison with the Bayesian method. We compute the probability that the allelic richness of a sample of standard size (the size of the smallest sample) from the larger sample is greater than (less than, or equal to) the observed allelic richness of the smallest sample. These probabilities can be used as the basis for a simple decision rule (as described below).
Here we present a simple Bayesian approach to the problem of predicting (from currently available samples) which population will yield the future sample having the highest allelic richness, when we collect samples of some fixed common size. This Bayesian approach is based on the Ewens sampling distribution (Ewens 1972, 1990). We use simulations to compare the performance of the Bayesian method and the Monte Carlo rarefaction (repeated random subsampling) method (Leberg 2002).
Within the Bayesian framework, we can compute the posterior (predictive) distribution of the random variable K(n'). We can also compute the posterior probability that the allelic richness K(n') of a future sample (of a standard size n') from one population is greater than (less than, or equal to) the allelic richness of a future sample from another population, conditional on the currently available sample from each. From these posterior (predictive) probabilities, we can identify directly which population is most likely to yield the future sample having the highest allelic richness. Furthermore, in contrast to the rarefaction method, the Bayesian framework allows us to compute these probabilities for any standard size n' we choose, greater than or less than the size of the smallest observed sample.
We need a model for the composition of a population. We have chosen to use the "Ewens sampling distribution" (Ewens 1972), partly for computational reasons. The Ewens sampling distribution is perhaps the simplest model for the composition of a hypervariable locus. In a population of M = 2N haploid individuals, where random drift occurs according to the Moran model (Moran 1958) and where mutation follows the "infinite alleles" mutation process with mutation rate u, at mutation-drift equilibrium the distribution of allele counts in the population is the Ewens sampling distribution with parameters M and
= 2Mu = 4Nu. Furthermore, the composition of a sample of n allele copies drawn for this population has the Ewens sampling distribution with parameters n,
. The Ewens sampling distribution applies approximately to a wider range of neutral models (for a review of relevant results, see Ewens 1990). The final ingredient for the Bayesian computation is the prior
(
) for the parameter
(see Appendix for details).
In order to estimate the posterior probabilities of interest, we need a method for generating a large number of observations from the posterior distribution
(X'|n'; X) of the composition of a future sample, of size n', conditional on a currently available sample of size n and composition X = (X1,X2, ... ,Xn) (the allele types of these n allele copies) drawn from the same population. From this we can obtain the marginal posterior distribution
(K'(n')|n'; X) of the allelic richness K'(n') of the new sample X'=(X'1, X'2, ... , X'n'). Our method (described in the Appendix) uses a standard acceptancerejection method, together with an urn scheme (Watterson 1976; Hoppe 1984), which generates a sample from the Ewens sampling distribution. This method is easily generalized to deal with situations where we have samples of alleles from multiple populations. For clarity, we will concentrate on the case of 2 populations. Suppose that we have a sample X1 of size n1 from population 1 and a sample X2 of size n2 from population 2. The posterior distributions of K'1(n') and K'2(n') are independent. (That is, we have
(K'1(n'),K'2(n')|n';X1;X2) =
(K'1(n')|n';X1)
(K'2(n')|n';X2).) Therefore, we can sample from the posterior distribution of the difference K'1(n') K'2(n'), simply by drawing K'1(n') from the posterior distribution
(K'1(n')|n';X1) and drawing K'2(n') from
(K'2(n')|n';X2), using the algorithm described above, and then computing the difference K'1(n') K'2(n'). By iterating this procedure, we can obtain a large sample from the posterior distribution
(K'1(n') K'2(n')|n';X1;X2), which can then be plotted as a histogram. We can also compute
1 =
(K'1(n')>K'2(n')|n';X1;X2),
2 =
(K'1(n')<K'2(n')|n';X1;X2), and
3 = 1
1
2 =
(K'1(n') = K'2(n')|n';X1;X2). These probabilities are a convenient summary of the posterior distribution of the difference in allelic richness. Furthermore, these probabilities can be used as the basis of our decision as to which population will yield the future sample (of size n') having the higher allelic richness.
More generally, whenever we have samples from an arbitrary number of populations, we could compute the posterior probability that a future sample from a particular population will have the highest allelic richness. We could also compute the posterior probability that future samples from these populations will have any particular ranking with respect to allelic richness.
Because our model for the composition of the population is the Ewens sampling distribution, where the diversity of the population depends on a single parameter
, we could have based our decision on whether the posterior probability
(
1 >
2X1;X2) is greater than or less than 1/2. However, the present approach, based on the posterior probabilities
1,
2, and
3, is applicable regardless of the model for the composition of the population.
We could make our decision according to whichever is the maximum of the 3 posterior probabilities
1,
2, and
3. Thus, we have Rule 1: If the maximum value is
1, then we predict that population 1 will yield the future sample (of size n') having the higher allelic richness; if the maximum value is
2, then we predict that population 2 will yield the future sample having the higher allelic richness; and if
3 is the maximum, then we remain undecided.
An alternative type of rule involves a critical probability level P. Rule 2: If
1 > P, then we predict that population 1 will yield the future sample (of size n') having the higher allelic richness; if
2 > P, then we predict that population 2 will yield the future sample having the higher allelic richness; otherwise we remain undecided. This rule makes sense provided that we choose the probability level in the interval 0.5
P < 1 because then only one of the probabilities,
1,
2,
3, can exceed P, and any probability which does exceed P is necessarily the maximum of the probabilities.
For comparison with our Bayesian method, we evaluate the performance of a decision rule based on the (Monte Carlo) rarefaction method. This decision rule (from now on Rule 3) is identical to Rule 2, except that the posterior probabilities
1,
2, and
3 are replaced by frequentist probabilities P1, P2, and P3. Let nmin = min{n1, n2}. The frequentist probabilities P1, P2, and P3 are defined as follows: P1 is the probability that a sample of size nmin from population 1 contains more allele types than a sample of the same size from population 2; P2 is the probability that a sample of size nmin from population 1 contains fewer allele types than a sample of the same size from population 2; and P3 = 1 P1 P2 is the probability that a sample of size nmin from population 1 contains the same number of allele types as a sample of the same size from population 2. These probabilities are computed by keeping the smaller of the original samples fixed and generating a sample of the same nmin = min{n1, n2} size from the larger of the original samples, by repeating the process of drawing a sample of size nmin without replacement from the same original sample (of size nmin = min{n1, n2}) a large number of times.
In the special case where n1 = n2 (and hence nmin = nmax), the rarefaction method is not applicable. Here we define P1 = 1 and P2 = 0, if the original sample from population 1 contains more allele types than the original sample from population 2; and we define P1 = 0 and P2 = 0, if the original sample from population 1 contains fewer allele types than the original sample from population 2.
We evaluated the performance of Rules 2 and 3 by generating a pair (X1, X2) of simulated samples (separately for population 1 and 2, respectively) according to the Ewens sampling distribution, with parameters
1 and
2, respectively (with
1 >
2), and sample sizes n1 and n2, respectively. We then applied Rules 2 and 3 to each pair (X1, X2) of simulated samples and recorded the proportion of these simulated data sets for which each rule succeeded and failed. Given that
1 >
2, we say that a "success" occurs when a rule predicts that population 1 will yield the future sample having the higher allelic richness (because population 1 is more likely to yield the sample having the highest allelic richness). On the same basis, we say that a "failure" occurs when a rule predicts that population 2 will yield the future sample having the higher allelic richness. So we do not include those cases where we remained undecided as either successes or failures.
We consider 2 alternative pairs of population parameters:
1 = 10 and
2 = 8 and
1 = 10 and
2 = 4, and a range of different sample sizes n1, n2 (each taking the values 80, 200, and 400). The probability level, in Rules 2 and 3, is varied from P = 0.5 to P = 0.9, with an increment of 0.1. In Rule 2, the probabilities
1,
2,
3 were computed for the case where the new samples are of size n' = 400. The proportion of successes is plotted in Figure 1, and the proportion of failures (as defined above) is plotted in Figure 2. Each point on these plots is based on 1000 simulated data sets (and hence 1000 separate Bayesian computations). Note that in the case of the Ewens sampling distribution, we expect the direction of the inequality
1 <
2 (or
2 <
1) to remain the same for all sample sizes n'. However, this need not be the case for other models of allelic composition.
|
|
First, we consider the situation where the population parameters are
1 = 10 and
2 = 8. Here we would expect our decision procedures to encounter greater difficulties, and this turns out to be the case. The sample sizes n1, n2 have relatively little effect on the proportion of successes and failures. In Figure 1, we see that the proportion of success for Rule 2 (Bayesian) is never very high, and it falls rapidly as the probability level is increased. For P = 0.5, the proportion of success reaches 0.7, but it never reaches as high as 0.8. For P = 0.6, the proportion of success was typically close to 0.6. In Figure 2, we see that the proportion of failure for Rule 2 (Bayesian) is relatively high for low value of the probability level. The proportion of failures also falls rapidly as the probability level is increased. The performance of Rule 3 is remarkably similar to that of Rule 2. (With the obvious exception of those comparisons where both samples are of the same size, where the rarefaction method of resampling is not applicable.)
Next, we turn to the situation where the population parameters are
1 = 10 and
2 = 4. In Figure 1, we see that the proportion of success for Rule 2 (Bayesian) is always high, and from P = 0.5 up to P = 0.8, the proportion of success never falls below 0.8. In Figure 2, we find that the proportion of failure for Rule 2 (Bayesian) is never higher than 0.03. (As a consequence, the values plotted are subject to sampling error.) These are the sort of results that one would hope for in a situation where the true values of the parameters
1,
2 differ by a factor of more than 2, as they do here. Again, the performance of Rule 3 (Rarefaction) is remarkably similar to that of Rule 2.
We would be reluctant to recommend an inference procedure which achieves a proportion of success lower than 0.6 in the case of a comparison where the parameter values differ appreciably. So the results of Figure 1 more or less compel us to recommend using P = 0.5 as the probability level. With this choice, acceptable results are achieved when
1 and
2 are relatively close, whereas excellent results are achieved whenever
1 and
2 differ more strongly, regardless of differences in the sample sizes.
Further information on the performance of Rules 2 and 3, in the case where P = 0.5, is provided in Figures 35, where we present results for the population parameters
1 = 10 and
2 = 8, and the same range of sample sizes n1, n2 (each taking the values 80, 200, and 400). Each histogram in Figures 35 shows the (relative) frequency distribution of K1(n1) K2(n2) = the difference in allelic richness between the simulated samples from the 2 different populations. Each of these histograms is based on the same 1000 simulated data sets as the corresponding graphs in Figures 1 and 2. In each histogram, whereas the column height indicates the proportion of simulated data sets which yielded a particular value for the difference K1(n1) K2(n2), the height of the black area indicates the proportion of these data sets for which Rule 2 (or Rule 3), with P = 0.5, made a successful prediction. From these histograms, we can see that correct predictions can be made even for some data sets where K1(n1) < K2(n2). Recall that we expect the prediction to be more difficult (more uncertain) for those data sets where the samples differ in size, and it is the largest sample which contains the most allele types. In summary, we have to conclude that Rule 2 and Rule 3 have very similar performance and that neither rule performed uniformly better than the other.
|
|
|
The evaluations of Rules 2 and 3 reported here are based on simulated data generated under the Ewens sampling distribution. Recall that our Bayesian method (Rules 2) is based on precisely this model. So, in this respect, it can be argued that our comparison of Bayesian and rarefaction methods favors the Bayesian method. We would expect the rarefaction method to be less model dependent than the Bayesian method. However, we have found that the rarefaction method does not always outperform the Bayesian method when the model departs from the Ewens sampling distribution. For example, in genealogical simulations of a neutral WrightFisher model (using SIMCOAL 2.0, Laval and Excoffier 2004), where the mutation process is a stepwise mutation model, the Bayesian method performed better than the rarefaction method in cases where the population was expanding, whereas the rarefaction method performed better than the Bayesian method in cases where the population was contracting (results not shown).
In our Bayesian computation, we have used the Ewens sampling distribution as a model for the allele composition of a population, mainly because the computation is then straightforward to implement. However, the Ewens sampling distribution may not be appropriate for hypervariable loci which are subject to various forms of balancing selection (such as major histocompatibility complex [MHC] loci and R genes). Even marker loci which have negligible fitness effects may depart substantially from the Ewens sampling distribution as a result of temporal fluctuations in population size, geographic substructure, or departure of the mutation process from the infinite alleles model. There would appear to be little prospect of finding methods for inferring about the allelic richness of larger samples (or the population) from smaller samples, which are in any way robust to the underlying model of population allelic composition.
Here, we have restricted our attention to making predictions about the allelic richness of samples. However, some authors (see, e.g., Comps et al. 2001, and the review by Widmer and Lexer 2001) have chosen to focus on the allelic richness of the population as a criterion for identifying the geographic locations of refugia. The rational for this is the observation that the allelic richness of a population is more sensitive than gene diversity (expected heterozygosity, Nei 1973) to short-lived population bottlenecks (Nei et al. 1975; Cornuet and Luikart 1996). However, although high allelic richness may reflect an absence of population bottlenecks in the past, in other cases it may be the result of secondary contact between previously isolated populations which have diverged.
A related reason for interest in allelic richness is its possible role as a predictor of the adaptive potential of a population (e.g., Petit et al. 1998). The choice of which populations of a species to prioritize for protection is a common dilemma in conservation management. For a particular species, we may want to base the decision on the estimated census population size at each site. Alternatively, we may want to make an assessment of the adaptive potential of each population. This may lead us to consider the additive genetic variance in quantitative traits. However, another aspect of adaptability which is of critical importance is the potential to evolve resistance to pathogens or parasites, and this often involves hypervariable loci such as MHC in vertebrates and R genes in plants (Leister 2004). It has been argued that allelic richness at putative neutral marker loci is a reliable predictor of allelic richness at loci which contribute to adaptation (Bataillon et al. 1996).
It is not obvious that the allelic richness of the population (even at loci of adaptive significance) is really the most relevant criterion. (Some of the rarest alleles may go extinct before they have any chance of contributing to future adaptation.) Furthermore, there is a fundamental statistical problem with the estimation of the allelic richness of a population. Any sample that is small relative to the size of the population can provide very little information about the presence of rare allele types. More generally, the allelic richness of small samples is not a reliable predictor of the allelic richness of larger samples. Hurlbert (1971) pointed this out in the context of species richness: "it is possible that at one sample size, collection A will have a greater sample species richness than collection B, whereas at a larger sample size, collection B will have the greater sample species richness." The same point was made earlier by Yule (1944) in his discussion of the vocabulary richness of authors. So, if we are interested in the adaptive potential of a population, then we may be better advised to estimate some related parameters about which a sample can provide more information.
Although Bayesian methods are in principal capable of taking account of our uncertainty about the composition of the population, they are still limited by the fact that a sample which is small relative to the size of the population can provide very little information about the presence of rare allele types. This suggests that we should shift our attention to a different parameter, such as the following. For any population, we can define the function
(x) as the number of allele types present in the population at frequencies greater than x. This function, defined for all x in the open interval (0, 1), is here referred to as the "cumulative frequency spectrum." Note that we are accumulating from right to left, so at the upper limit we define
(1) = 0 and at the lower limit we have
(0) = K(2N). Now, rather than simply generating the posterior distribution of
(0), we could generate the posterior distribution for
(x) at each value x in some sequence of values across the interval (0, 1). We could represent this visually by identifying the median and the upper and lower quartiles of the posterior distribution of
(x) at each value x and plotting these quantiles against x. For high values of x, we can expect the posterior distribution of
(x) to be concentrated around its true value because even a sample of moderate size can provide information about the frequencies of the most common allele types. However, a sample of size n can provide very little information about the frequencies of alleles which are present at frequencies of order 1/n or smaller. So, for values of x which are of order 1/n or smaller, we can expect the posterior distribution of
(x) to remain as diffuse as the prior distribution. In particular, unless the sample size is of the same order as the size of the population, we should not expect the posterior distribution of
(0) = K(2N) to differ much from its prior distribution. We believe that the cumulative allelic spectrum is a more useful object of inference than allelic richness of the population. The development of such methods based on the cumulative frequency spectrum appears to be a promising area for future research. Finally, we note that because the cumulative frequency spectrum is a parameter of the population, it is difficult to conceive of directly comparable methods based on rarefaction.
The urn algorithm described in the appendix will be made available as a module in the GENETIX software package (www.genetix.univ-montp2.fr/genetix/genetix.htm). Functions for computing the posterior probabilities for pairwise comparisons will also be included.
| Appendix |
|---|
|
|
|---|
In this Appendix, we describe a method for generating a large number of observations from the posterior distribution
(X'|n'; X) of the composition of a future sample, of size n', given an observed sample of size n.
First we recall an urn construction which plays a central role in our algorithm for performing this Bayesian computation. Watterson (1976) and Hoppe (1984) discovered (independently) that a sample from the Ewens sampling distribution can be generated using the following modified version of the classical Polya urn scheme. Imagine an idealized urn containing a number of balls. Each ball has a "label," or "type," except for one special ball, which, for reasons that will soon become apparent, we will call the "mutating ball." For convenience, positive integers are used as labels. When a ball is drawn from this idealized urn, the probability of drawing a particular ball is proportional to the size of the ball. The mutating ball is of size
, whereas all other balls are of size 1.
The urn process is as follows. At each iteration, one ball is drawn from the urn. If the mutating ball is drawn, it is returned to the urn together with a new ball of a new type (i.e., a type not already present in the urn). If any other ball is drawn, it is returned to the urn together with a new ball of the same type. This determines the composition of the urn at the start of the next generation.
If the urn contains n balls in addition to the mutating ball, then the labeling of these balls determines a partition of the integer n. Now suppose that at the first iteration the urn contains 1 ball of type 1 together with the mutating ball. With this initial condition, the composition of the urn at iteration n determines a partition of n, and the distribution of this partition is the Ewens sampling distribution for a sample of size n and parameter
.
This urn scheme suggests a way of sampling K(n) from its marginal distribution P(K(n)|n,
), under the Ewens sampling distribution with sample of size n and parameter
. K(n) can be expressed as a sum of n random variables
|
|
|
|
, the following Monte Carlo method generates independent observations from the exact posterior distribution
(X'|n'; X) of the composition of a future sample, of size n', given an observed sample of size n.- Draw a value
* of the parameter
from the prior distribution
(
);
- Draw a value k* for the allelic richness K(n) from the distribution P(K(n)|n,
*);
- Repeat (1) and (2) until k* = k (the allelic richness of the observed sample) and then retain the parameter value
*;
- Empty the urn, then add the observed sample (of size n and composition X = (X1, X2, ... ,Xn)) to the urn, together with a mutator ball of size
*;
- Iterate the urn process n' times (i.e., until there are n + n' balls, together with the mutator ball, in the urn). The last n' balls to be added to the urn constitute the new sample X' = (X'1, X'2, ..., X'n') from the same population.
To implement step 2, we can use the sequence of n independent Bernoulli trials, as described above.
Each pair (
*, X') generated by the above algorithm is an independent draw from the joint posterior distribution
(
;X'|n';X). Therefore, the corresponding values of K'(n') are independent observations from marginal posterior distribution
(K'(n')|n';X) of the allelic richness of the new sample X' = (X'1,X'2, ... ,X'n').
A convenient choice for the prior distribution of the parameter
is the Gamma distribution
|
|
is the shape parameter and
is the scale parameter (
> 0,
> 0). In the work reported above, we apply the constraint
= 1 +
, together with
> 0, to ensure that the mode is located at 1. We suggest using
= 1.1 and
= 0.1, so that the mode is at 1, the mean is 11, and variance is 110. This ensures that most of the probability mass is distributed over the interval from 0 up to 50, which we believe will usually cover the range of plausible values for parameter
. | Acknowledgments |
|---|
This work was supported by the Centre National de la Recherche BIOINFO initiative. Rothamsted Research receives grant-aided support from the Biotechnology and Biological Sciences Research Council of the United Kingdom.
| Footnotes |
|---|
Corresponding Editor: Bruce Weir
| References |
|---|
|
|
|---|
-
Bataillon TM, David JL, Schoen DJ. (1996) Neutral genetic markers and conservation genetics: simulated germplasm collections. Genetics 144:409417.[Abstract]
Comps B, Gomory D, Letouzey J, Thiébaut B, Petit RJ. (2001) Diverging trends between heterozygosity and allelic richness during postglacial in the European beech. Genetics 157:389397.
Cornuet J-M and Luikart G. (1996) Description and power analysis of 2 tests for detecting recent population bottlenecks from allele frequency data. Genetics 144:20012014.[Abstract]
El Mousadik A and Petit RJ. (1996) High level of genetic differentiation for allelic richness among populations of the argan tree [Argania spinosa (L.) Skeels] endemic to Morocco. Theor Appl Genet 92:832839.[CrossRef][ISI]
Ewens WJ. (1972) The sampling theory of selectively neutral alleles. Theor Popul Biol 3:87112.[CrossRef][ISI][Medline]
Ewens WJ. (1990) Population genetics theorythe past and the future. In Lessard S (Ed.). Mathematical and statistical developments of evolutionary theory(Kluwer Academic Publishers, Dordrecht, the Netherlands) pp. 177227.
Heck KL, van Belle G, Simberloff D. (1975) Explicit calculation of the rarefaction diversity measurement and the determination of sufficient sample size. Ecology 56:14591461.[CrossRef]
Hoppe FM. (1984) Polya-like urns and the Ewens sampling formula. J Math Biol 20:9199.
Hurlbert SH. (1971) The non-concept of species diversity: a critique and alternative parameters. Ecology 52:577586.[CrossRef][ISI]
James FC and Rathbun S. (1981) Rarefaction, relative abundance, and diversity of avian communities. Auk 98:785800.
Kalinowski ST. (2004) Counting alleles with rarefaction: private alleles and hierarchical sampling designs. Conserv Genet 5:539543.[CrossRef]
Laval G and Excoffier L. (2004) SIMCOAL 2.0: a program to simulate genomic diversity over large recombining regions in a subdivided population with a complex history. Bioinformatics 20:24852487.
Leberg PL. (2002) Estimating allelic richness: effects of sample size and bottlenecks. Mol Ecol 11:24452449.[CrossRef][Medline]
Leister D. (2004) Tandem and segmental gene duplication and recombination in the evolution of plant disease resistance genes. Trends Genet 20:116121.[CrossRef][ISI][Medline]
Moran PAP. (1958) Random processes in genetics. Proc Camb Philos Soc 54:6971.
Nei M. (1973) Analysis of gene diversity in subdivided populations. Proc Natl Acad Sci USA 70:33213323.
Nei M, Maruyama T, Chakraborty R. (1975) Bottleneck effect and genetic variability in populations. Evolution 29:110.
Petit RJ, El Mousadik A, Pons O. (1998) Identifying populations for conservation on the basis of genetic markers. Conserv Biol 12:844855.[CrossRef]
Saunders HL. (1968) Marine benthic diversity: a comparative study. Am Nat 102:243282.[CrossRef][ISI]
Schoen DJ and Brown AHD. (1993) Conservation of allelic richness in wild crop relatives is aided by assessment of genetic markers. Proc Natl Acad Sci USA 90:1062310627.
Watterson GA. (1976) The stationary distribution of the infinitely-many neutral alleles diffusion model. J Appl Probab 13:639651.[CrossRef]
Widmer A and Lexer C. (2001) Glacial refugia: sanctuaries for allelic richness, but not for gene diversity. Trends Ecol Evol 16:267269.[CrossRef][Medline]
Yule GU. (1944) The statistical study of literary vocabulary(Cambridge University Press, Cambridge, UK).
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||




