Skip Navigation



Journal of Heredity Advance Access published online on February 29, 2008

Journal of Heredity, doi:10.1093/jhered/esm127
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
99/4/438    most recent
esm127v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Cai, J. J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Cai, J. J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The American Genetic Association. 2008. All rights reserved. For permissions, please email: journals.permissions@oxfordjournals.org.

Computer Notes

PGEToolbox: A Matlab Toolbox for Population Genetics and Evolution

James J. Cai

From the Department of Biology, Stanford University, 371 Serra Mall, Stanford, CA 94305

Address correspondence to James J. Cai at the address above, or e-mail: jamescai{at}stanford.edu.

Assessing genetic diversity within populations is vital for understanding the nature of evolutionary processes at the molecular level. PGEToolbox is a Matlab-based open-sourced software package for data analysis in population genetics. The main features of this software are as follows: 1) capability for handling both DNA sequence polymorphisms and single nucleotide polymorphisms (SNPs), which include genotype and haplotype data; 2) exhaustive population genetic analyses and neutrality tests based on the coalescent theory; 3) extendibility and scalability for complex and large genome-wide datasets; 4) simple yet effective graphic user interfaces and sophisticated visualization of data and results. For academic uses, PGEToolbox is available free of charge at http://bioinformatics.org/pgetoolbox.


Assessing genetic diversity is vital for understanding the nature of evolutionary processes at the molecular level. Over many years, powerful methods have been developed to analyze genetic data to elucidate the influence of mutation, random genetic drift, migration, and natural selection on genetic diversity. Dedicated computer programs implementing these methods become essential for extracting embedded information. The recent advent of cost-efficient genotyping techniques has greatly facilitated the assessment of genetic diversity within a population. Consequently, massive computations are often required to analyze the large-scale genetic data obtained.

PGEToolbox (from Population Genetics and Evolution Toolbox) is a software package written in Matlab for data analysis in molecular population genetics. As a high-performance language for technical computing, Matlab has been increasingly appreciated by biologists for data analysis (e.g., Cai et al. 2005). However, few Matlab functions are available for analyzing genetic sequence variation data (including DNA sequence polymorphisms and single nucleotide polymorphisms [SNPs]). PGEToolbox has been developed to fill this gap, providing functions for data manipulation, population genetic statistics calculation, and neutrality tests. Major statistics and tests implemented in PGEToolbox for DNA sequence polymorphisms are given in Table 1.


View this table:
[in this window]
[in a new window]

 
Table 1. Major statistics and tests implemented in PGEToolbox for DNA sequence data

 
The majority of its functions can be achieved with existing software, but PGEToolbox has several advantages. Existing software can be categorized as either program or library. The former includes, for example, DnaSP (Rozas et al. 2003), Genepop (Raymond and Rousset 1995), and Arlequin (Schneider et al. 2000). The later includes the libsequence (Thornton 2003), the PopGen module of the BioPerl project, the PAL project (Drummond and Strimmer 2001), and the PopGenLib library of the Bio++ project (Dutheil et al. 2006). Compared with libraries, programs are usually more user friendly because of the developed interfaces and outputs, but they lack flexibility in terms of code customization. PGEToolbox stands in the continuum between program and library. It is highly scalable because it can be easily set up as scripts (calling one function after another) to perform an entire job in an unattended batch mode. Furthermore, it contains simple yet efficient menu-driven and dialog-driven graphic user interfaces, which hide the complexity of the computations from the end users. PGEToolbox is under an open source license, which allows others to extend and reuse components, enables interoperation via an open and published interfaces, and reduces duplication of effort within the community. It is running under all 3 major operation systems—Microsoft Windows, UNIX, and Macintosh. Although PGEToolbox requires a Matlab-running environment, the dependency has been minimized: PGEToolbox is independent from any other Matlab toolboxes and back compatible with the earlier version of Matlab (version 6.5). PGEToolbox might be compatible with free alternatives to Matlab as Octave (http://www.gnu.org/software/octave/) and Scilab (http://www.scilab.org/) given some revisions. The more recent version of Matlab includes some code and memory efficiency features, such as the single-precision arithmetic, the memory mapping function, and the support for 64-bit platforms. When these features are used, PGEToolbox should be able to handle modern genetics datasets on the scale of thousands of individuals and hundreds of thousands of sequences or SNPs.

The calculation of sequence polymorphism statistics is a routine task in molecular population genetics. To do this, PGEToolbox first reads a DNA sequence alignment in FASTA or Phylip format. Several polymorphism statistics, such as the number of segregating sites, site-frequency spectrum (SFS), and the nucleotide diversity, can then be calculated. For the population mutation parameter, {theta} = 4Nµ (where N is the effective population size and µ is the per-locus mutation rate per generation), PGEToolbox calculates several common estimates, including the number of segregating sites, {theta}W, (Watterson 1975), the mean pairwise difference between nucleotide sequences, {theta}{pi} (Nei 1987), and Fay's {theta}H (Fay and Wu 2000). PGEToolbox conducts several neutrality tests, such as Tajima's D test (Tajima 1989), Fu and Li's D* and F* tests (Fu and Li 1993), Strobeck's S statistic (Strobeck 1987), Wall's B and Q tests (Wall 1999), Fay and Wu's H test (Fay and Wu 2000) (where H statistic is normalized according to Zeng et al. [2006]), Ewens–Watterson homozygosity test (Ewens 1972; Watterson 1978), and Kelly's ZnS test (Kelly 1997). Testing of the significance of these statistics requires generating bootstrap samples from a neutral model using a coalescent approach. To do this in Matlab, PGEToolbox incorporates the program ms, which is originally written by Hudson (2002) in C, into a MEX-function (the C interface to Matlab). Such a code migration supplies PGEToolbox with extensive capabilities in coalescent simulations with various parameter settings. The dialog called coalsimdlg has been developed to assist users in setting up these parameters. The function fst_weir calculates Weir's formulation of Wright's FST (Weir and Cockerham 1983; Weir 1996). PGEToolbox can also analyze patterns of genetic diversity within and between population samples by using the McDonald–Kreitman (MK) test (McDonald and Kreitman 1991) and its 2 extensions (Fay et al. 2001; Smith and Eyre-Walker 2002). In doing so, 2 functions count the numbers of synonymous (Ds) and nonsynonymous (Dn) divergences, and the numbers of synonymous (Ps) and nonsynonymous (Pn) polymorphisms. The MK test can be initiated from 2 functions: the command-line function mktestcmd and mktestgui that invokes a pop-up dialog of 2 x 2 contingency table. The function sewfww estimates the average proportion of amino acid substitutions driven by positive selection by using the methods of Fay et al. (2001), and Smith and Eyre-Walker (2002). Finally, it is important to note that current version PGEToolbox deletes all sites with missing data that may lead to the loss of information.

PGEToolbox provides a variety of functions for SNP analysis. The graphic user interface of these SNP-related functions is snptool, which can adjust its menu for genotype or haplotype SNP data according to users' choice. In the genotype data mode, snptool opens input file in the format specified by either HapMap (International HapMap Consortium 2003) or Perlegen (Hinds et al. 2005) projects. Alternatively, snptool can download SNP data for all 4 HapMap populations: Yoruba from Ibadan, Nigeria, Japanese from Tokyo, Chinese Han from Beijing, and CEPH individuals from Utah (with northern and western European ancestry). After reading the data, snptool computes the observed and predicted heterozygosities, the minor allele frequency, the P value of the Hardy–Weinberg equilibrium test, the allele and genotype frequency, the composite likelihood (Nielsen et al. 2005), Tajima's D (1989), and Fay and Wu's H (2000). A warning message displays in case the "folded" SFS are mistakenly provided by users to calculate Fay and Wu's H which requires the "unfolded" SFS (whose ancestral alleles are usually inferred via parsimony using an outgroup). The snptool uses the expectation-maximization algorithm to estimate the probabilities of haplotypes and calculates linkage disequilibrium (LD) statistics, such as D, D', and R, between pairs of SNPs. The snptool displays results and datasets graphically. Some examples include a pie chart displaying SNP allele and genotype frequencies of 4 HapMap populations, a plot of relative positions of SNPs on the chromosome, and a visual genotype view (via function snp_vgview) presenting complete raw dataset of individuals’ genotypes. In the haplotype mode, snptool calculates the haplotype diversity (Depaulis and Veuille 1998), the haplosimilarity score (Hanchard et al. 2006), the chromosome segment homozygosity (Hayes et al. 2003), the minimum number of recombination events (Rm) (Hudson and Kaplan 1985), the extended haplotype homozygosity (EHH) (Sabeti et al. 2002), and the integrated haplotype score (iHS) (Voight et al. 2006). EHH and iHS are 2 particularly powerful statistics which have been used for detecting recent selection (Sabeti et al. 2002; Voight et al. 2006). EHH is based on the statistic haplotype homozygosity, HH = ({sum}pFormula – 1/n)/(1 – 1/n), where pi is the relative haplotype frequency and n the sample size. For selected core haplotypes, EHH calculates HH in a stepwise manner to determine how LD breaks down with increasing distance to a specified core region. The iHS is based on the differential levels of EHH surrounding an allele compared with the background allele at the same position.

In summary, I show the usefulness of Matlab as a powerful and convenient scientific computation language in molecular population genetics. PGEToolbox is a powerful and flexible Matlab toolbox dedicated to the analysis of DNA sequence polymorphisms and SNPs. The current version implements a number of algorithms, methods, and tools and is ready to be tailored or extended to specific tasks and scaled up for exhaustive exploratory analyses of genome-wide data.


    Acknowledgments
 
J.J.C. thanks Dmitri Petrov and Mike Macpherson for very helpful discussions and the anonymous reviewers who provided useful comments.


    Footnotes
 
Corresponding Editor: Robert Wayne

Received August 1, 2008
Accepted December 7, 2008


    References
 Top
 References
 

    Cai JJ, Smith DK, Xia X, Yuen KY. MBEToolbox: a Matlab toolbox for sequence data analysis in molecular biology and evolution. BMC Bioinformatics (2005) 6:64.[CrossRef][Medline]

    Depaulis F, Veuille M. Neutrality tests based on the distribution of haplotypes under an infinite-site model. Mol Biol Evol (1998) 15:1788–1790.[Web of Science][Medline]

    Drummond A, Strimmer K. PAL: an object-oriented programming library for molecular evolution and phylogenetics. Bioinformatics (2001) 17:662–663.[Abstract/Free Full Text]

    Dutheil J, Gaillard S, Bazin E, Glemin S, Ranwez V, Galtier N, Belkhir K. Bio++: a set of C++ libraries for sequence analysis, phylogenetics, molecular evolution and population genetics. BMC Bioinformatics (2006) 7:188.[CrossRef][Medline]

    Ewens WJ. The sampling theory of selectively neutral alleles. Theor Popul Biol (1972) 3:87–112.[CrossRef][Web of Science][Medline]

    Fay JC, Wu CI. Hitchhiking under positive Darwinian selection. Genetics (2000) 155:1405–1413.[Abstract/Free Full Text]

    Fay JC, Wyckoff GJ, Wu CI. Positive and negative selection on the human genome. Genetics (2001) 158:1227–1234.[Abstract/Free Full Text]

    Fu YX. Statistical tests of neutrality of mutations against population growth, hitchhiking and background selection. Genetics (1997) 147:915–925.[Abstract]

    Fu YX, Li WH. Statistical tests of neutrality of mutations. Genetics (1993) 133:693–709.[Abstract]

    Hanchard NA, Rockett KA, Spencer C, Coop G, Pinder M, Jallow M, Kimber M, McVean G, Mott R, Kwiatkowski DP. Screening for recently selected alleles by analysis of human haplotype similarity. Am J Hum Genet (2006) 78:153–159.[CrossRef][Web of Science][Medline]

    Hayes BJ, Visscher PM, McPartlan HC, Goddard ME. Novel multilocus measure of linkage disequilibrium to estimate past effective population size. Genome Res (2003) 13:635–643.[Abstract/Free Full Text]

    Hinds DA, Stuve LL, Nilsen GB, Halperin E, Eskin E, Ballinger DG, Frazer KA, Cox DR. Whole-genome patterns of common DNA variation in three human populations. Science (2005) 307:1072–1079.[Abstract/Free Full Text]

    Hudson RR. Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics (2002) 18:337–338.[Abstract/Free Full Text]

    Hudson RR, Kaplan NL. Statistical properties of the number of recombination events in the history of a sample of DNA sequences. Genetics (1985) 111:147–164.[Abstract/Free Full Text]

    International HapMap Consortium. The International HapMap Project. Nature (2003) 426:789–796.[CrossRef][Medline]

    Kelly JK. A test of neutrality based on interlocus associations. Genetics (1997) 146:1197–1206.[Abstract]

    McDonald JH, Kreitman M. Adaptive protein evolution at the Adh locus in Drosophila. Nature (1991) 351:652–654.[CrossRef][Medline]

    Nei M. Molecular evolutionary genetics (1987) New York: Columbia University Press.

    Nielsen R, Williamson S, Kim Y, Hubisz MJ, Clark AG, Bustamante C. Genomic scans for selective sweeps using SNP data. Genome Res (2005) 15:1566–1575.[Abstract/Free Full Text]

    Ramos-Onsins SE, Rozas J. Statistical properties of new neutrality tests against population growth. Mol Biol Evol (2002) 19:2092–2100.[Abstract/Free Full Text]

    Raymond M, Rousset F. GENEPOP (version 1.2): population genetics software for exact tests and ecumenicism. J Hered (1995) 86:248–249.[Free Full Text]

    Rozas J, Sanchez-DelBarrio JC, Messeguer X, Rozas R. DnaSP, DNA polymorphism analyses by the coalescent and other methods. Bioinformatics (2003) 19:2496–2497.[Abstract/Free Full Text]

    Sabeti PC, Reich DE, Higgins JM, et al. Detecting recent positive selection in the human genome from haplotype structure. Nature (2002) 419:832–837.[CrossRef][Medline]

    Schneider S, Roessli D, Excoffier L. Arlequin: a software for population genetics data analysis (2000) Genetics and Biometry Laboratory, Department of Anthropology, University of Geneva. http://lgb.unige.ch/arlequin/.

    Smith NG, Eyre-Walker A. Adaptive protein evolution in Drosophila. Nature (2002) 415:1022–1024.[CrossRef][Medline]

    Strobeck C. Average number of nucleotide differences in a sample from a single subpopulation: a test for population subdivision. Genetics (1987) 117:149–153.[Abstract/Free Full Text]

    Tajima F. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics (1989) 123:585–595.[Abstract/Free Full Text]

    Thornton K. Libsequence: a C++ class library for evolutionary genetic analysis. Bioinformatics (2003) 19:2325–2327.[Abstract/Free Full Text]

    Voight BF, Kudaravalli S, Wen X, Pritchard JK. A map of recent positive selection in the human genome. PLoS Biol (2006) 4:e72.[CrossRef][Medline]

    Wall JD. Recombination and the power of statistical tests of neutrality. Genet Res (1999) 74:65–79.[CrossRef][Web of Science]

    Watterson GA. On the number of segregating sites in genetical models without recombination. Theor Popul Biol (1975) 7:256–276.[CrossRef][Web of Science][Medline]

    Watterson GA. The homozygosity test of neutrality. Genetics (1978) 88:405–417.[Abstract/Free Full Text]

    Weir BS. Genetic data analysis II: methods for discrete population genetic data (1996) Sunderland (MA): Sinauer Associates.

    Weir BS, Cockerham CC. Estimating F-Statistics for the analysis of population structure. Evolution (1983) 38:1358–1370.[CrossRef]

    Zeng K, Fu YX, Shi S, Wu CI. Statistical tests for detecting positive selection by utilizing high-frequency variants. Genetics (2006) 174:1431–1439.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
99/4/438    most recent
esm127v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Cai, J. J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Cai, J. J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?