Heterozygosity

Assessing genetic structure using codominant, allelic
markers
within and among populations (WAAP analysis), meaning
of AE
and tips on software-based analyses.
Return to Main Index page
Download WebSoftware.doc list of web software resources
We have looked at the derivations for a number of population genetic parameters (variance-
based and distance measures of population structure) and their strengths and weaknesses in
the face of various complexities of natural populations (e.g., small and fluctuating population
size, variation in the breeding sex ratio). We will now focus on the practicalities of assessing
the genetic structure within and among populations -- what measures are essential for any sort
of reasonably comprehensive assessment of genetic structure, what programs are available for
computing those measures and how do we organize data for analysis?
Here are some of the essential components (adapted from a checklist developed by Jim
Hamrick at U. GA):
I. Total variation over the entire set of populations:

P: Polymorphism (% loci polymorphic -- microsatellites most often have P =
1.0)
A: Alleles per locus; AP (alleles/polymorphic locus for studies with P <1.0)
AE = effective number of alleles
H: heterozygosity -- observed (HO) and expected (HE or D, gene diversity).
(Estimates of variation in mutation rates across loci)
II. Within population variation:
Mean P, A, AE, H. The values in Part I are calculated with all the samples
considered to constitute a single group.
These ones are calculated population by population, then averaged over the set
of populations.
Differences among populations in the above. Does one or more populations
have unusually high or low values for any of the above?
Deviations from Hardy-Weinberg expectations (per locus and population)
Assessment of linkage disequilibrium
Estimates of Ne, effective population size (4*Ne*m)
Relatedness or allele-sharing matrices
III. Among population variation:

FST, GST, RST -- variance measures. Hierarchical, if appropriate.
Major differences in allele frequencies among populations
Patterns of variation: clinal, ecotypic, latitudinal etc.
Assignment tests (how well do individuals match the population in which they
were sampled?)
Genetic distances (Cavalli-Sforza, Nei's 1978 et al.)
Correlation between genetic distance and geographic distance (Mantel tests).
Essentially we are testing for an effect of isolation by distance (IBD effect).
Are poplations that are further apart geographically more gentically different?
Estimates of gene flow, effective population size (4*Ne*m)
Tree-building, phylogenetic approaches
Assessment of whether partitions exist in the data (Bayesian approaches, tree-
building analyses)
A note on the calculation and uses of AE (effective number of

alleles)
AE is the effective number of alleles (at the level of the OTUs we are examining). Verbally,
this measure is the number of equally frequent alleles it would take to achieve a given level
of gene diversity. That is, it allows us to compare populations where the number and
distributions of alleles differ drastically. The formula is:
Eqn 1
where Dj is the gene diversity of the jth of r loci. Note that we calculate the OTU-level AE by
averaging over the AE calculated locus-by-locus rather than by calculating a mean gene
diversity and then calculating AE from that. The graph below shows why: AE is a nonlinear
function of the gene diversity (Hexp), which brings into play Jensen’s inequality [the
expectation of a function ≠ the function of the expectations for nonlinear curves; see Ruel and
Ayres (1999)]. Here, because the curve is concave up, the AE we compute will be greater than
if we calculated it from the overall gene diversity.
Fig. 1. Effective number of alleles, AE, as a function of the gene diversity (D or Hexp). The
nonlinear relationship brings into play Jensen’s inequality. Note that most of the "action"
happens for D in the range 0.5 to 0.9 (AE goes from 2 to 10).
The meaning of AE. Say we have two populations (or species, or whatever our OTU
is) with the same number of total alleles, but with very different distributions of allele
frequencies. We would like to be able to assess the effective number of alleles as a corollary
to the expected heterozygosity. Remember that, for any given number of alleles, the expected
heterozygosity (gene diversity) is highest when the all the allele frequencies are equal (look at
Fig. 5.1 in the web notes). Simply reverse the logic. When the heterozygosity is high (the
peak of the curve in Fig. 5.1) we will have the highest effective number of alleles. For a
heterozygosity of 0.85 we will have, effectively, 6.7 alleles {formula is AE= 1/(1-Hexp)}. If a
locus has 8 total alleles (meaning a maximum possible Hexp of 0.875), but the Hexp is only 0.6,
the effective number of alleles will be only 2.5. This tells us that we have a set of alleles with
very different frequencies. Alleles with frequencies away from the even “average” contribute
very little to the effective number of alleles. When will the effective number of alleles be the
same as the actual number of alleles? At the maximum gene diversity (peak of the curve).
When will it be at a minimum (near 1)? When one allele (the only real contributor to the
effective allele number) dominates the allele frequencies and all the others are very rare.
Imagine that one OTU has 10 total alleles, another just 4; they could have the same effective
number of alleles, if the allele frequencies are very unbalanced in the first case and much
more balanced in the second case. Because of the reciprocal nature of the formula, if the
OTUs have the same AE, they will have the same Hexp. That is, if AE1 = AE2, then Hexp1 =
Hexp2.
Lecture 3. Population Genetics I.

Introduction to Population Genetics
Return to index page
Required readings: Avise text, pp. 248-257, Gillespie book Chapters 1, 2 and 5.
Population genetics is the study of Mendel’s laws, the Hardy-Weinberg principle and other
genetic principles as they apply to entire populations of organisms. Population genetics
describes genetic variation in populations, and determines, by observation, experiment and
theory, how that variation changes over time and space. In other words, how much variation
exists in natural populations, and how can we explain variation in terms of origin,
maintenance, and evolutionary importance?
What’s useful about population genetics?
 Management issues. We can answer questions about patterns and trends in

genetic variability, gene flow (are populations genetically connected?), patterns of
dispersal (do males or females tend to disperse further?), Ne (effective population size,
which we will address below), fluctuations in population size (called "bottlenecks"
and also addressed below), forensics applications (legal cases -- does the carcass in
the suspect’s freezer match the poached animal found by a warden), and population
size estimation (using genetic mark-recapture techniques).
 Conservation of threatened and endangered species. Assessment of genetic

variability to prevent inbreeding depression (useful only for captive breeding
programs aimed at highly valued, charismatic organisms such as black-footed ferrets),
recognition of threatened and endangered species (is this candidate "species" actually
distinct from a more widespread common form found elsewhere?), and molecular
forensics for law enforcement (is this animal part from an endangered ferret or a legal
mink?).
 Evolutionary understanding. Population genetics provides a coherent framework for
understanding how biodiversity is created and maintained. Population genetics helps
explain the spread of traits under the influence of natural selection (a huge topic,
which I will largely ignore).
Hardy-Weinberg Principle
The Hardy-Weinberg principle (and its predicted equilibrium) is the cornerstone of

population genetics. Developed independently by George Hardy and Wilhelm Weinberg in
the early 1900’s, the Hardy-Weinberg principle is a model that relates allele frequencies to
genotype frequencies. Like most models, Hardy-Weinberg is a simplification of real world
complexities -- but it has amazing explanatory power nonetheless.
Remember (memorize) the five major assumptions that lead to a Hardy-Weinberg

equilibrium (click links to see discussion of each force):
 No Non-random Mating
 Infinite population size (= No Genetic Drift)
 No Mutation
 No Genetic Migration (permanent movement of alleles from one
population to another, usually by dispersal of individuals)
 No Natural Selection (plus sexual selection)
{Three additional assumptions are that the organisms are diploid, reproduce sexually and
have non-overlapping generations}.
Violations of any of the five major assumptions are the

primary forces that drive evolutionary change.
Remember that an allele is a variant form of a gene (piece of DNA) at a single locus (Latin
for "place", so we are referring to a particular stretch -- for example a stretch of 275 base
pairs on Chromosome 13). An allele frequency (geneticists call it "gene frequency") is
therefore a measure of the commonness of an allele in a population (the proportion of a
specific allele in a population -- how common is the A ["big A"] allele, or the a ["little a"]
allele). A genotype is the specific allele composition for a certain locus or set of loci (Aa, AA,
or AaBBcc for several loci vs. a second genotype AabbCc). Genotype frequency is a measure
of the commonness of a genotype in a population; i.e., the proportion of a specific genotype
in a population. Two major terms are important in discussing genotypes: homozygote and
heterozygote. A homozygote has two copies of the same allele (e.g., AA or bb). A
heterozygote has two different alleles at a given locus (e.g., Aa or Dd). Because the allele and
genotype frequencies are proportions they always sum to 1.0, if we have included all the
possible variants.
Allele frequencies:
p + q = 1 Eqn 3.1
Expected genotype frequencies:

p2 + 2pq + q2 = 1 Eqn 3.2
The possible range for an allele frequency or genotype frequency therefore lies between zero
and one, with zero meaning complete absence of that allele or genotype from the population
(no individual in the population carries that allele or genotype); a one means complete
fixation of the allele or genotype (fixation means that every individual in the population is
homozygous for the allele -- i.e., has the same genotype at that locus).
With the five assumptions given above, one can calculate the genotype frequencies for a gene
with two alleles (A and a). The frequency of homozygous genotype AA is the probability of
one allele A being in combination with another allele A. The expected frequency is simply the
product of the separate allele frequencies. We will use the term p to refer to the frequency of
allele A:
Frequency of AA = p2 Eqn 3.3

The frequency of heterozygous genotype Aa is the probability of allele A being in
combination with allele a. Note that there are two possible ways to get those combinations --
A from Dad and a from Mom, or vice versa (examine Fig. 3.1 below).
Frequency of Aa = 2pq Eqn 3.4
The frequency of homozygous genotype aa is the probability of one allele a in combination
with another allele a.
Frequency of aa = q2 Eqn 3.5
Fig. 3.1. Diagram of Hardy-Weinberg genotype proportions. Given a

locus with two alleles designated A and a that occur with frequencies p
and q, the chart shows the genotype frequencies (p2, 2pq, and q2) as
differently colored areas. Note that the heterozygotes (blue + yellow =
green) can be formed in two different ways (in terms of combination
theory, this means order is not important). Extending this logic and its
implications to multiple alleles and multiple loci provides the basis for
much of the core theory of population genetics.
Example: if p = 0.75 and q = 0.25 we can use Eqns 3.3, 3.4, and 3.5 to calculate the expected
genotype frequencies.
AA = p2 = 0.75 * 0.75 = 0.5625
Aa = 2pq = 2 * 0.75 * 0.25 = 0.375
aa = q2 = 0.25 * .025 = 0.0625 Eqns 3.6
The values we have just calculated are EXPECTED genotype frequencies IF the Hardy-
Weinberg assumptions hold. We now turn to how we could check that from actual
OBSERVED genotypic data (such as the microsatellite data for Wyoming black bears). In
order to calculate allele frequencies all we need are the observed genotype frequencies. [No
assumptions needed about the five forces, but what statistical requirement.assumption do we
need to have in place?]
p = p2 + (2pq/2) and q = q2 + (2pq/2) Eqn 3.7
Let's look at an example from the beginning. We will examine a population of trout with a di-
repeat microsatellite marker that has two alleles, 120 and 122. For simplicity, let’s call allele
120 A and allele 122 a. We genotype 100 individuals and find genotype frequencies of AA =
0.25, Aa = 0.5, and aa = 0.25 (check that when summed these genotype frequencies add to
one). We ask the question of whether this population is in Hardy-Weinberg equilibrium. We
first need to calculate the p and q (allele frequencies of A and a; note that the A and a are
names for the alleles themselves, the p and q refer to the frequencies of those alleles). We
calculate the frequencies using Eqns 3.6.
p = p2 + (2pq/2) = 0.25 + (0.5/2) = 0.5
q = q2 + (2pq/2) = 0.25 + (0.5/2) = 0.5 Eqns 3.8
We see that the allele frequencies sum to one, as required by Eqn 3.1. Using the allele
frequencies, we then calculate the expected genotype frequencies using Eqns 3.3, 3.4, and
3.5.
AA = p2 = 0.5 * 0.5 = 0.25
Aa = 2pq = 2 * 0.5 * 0.5 = 0.5
Aa = q2 = 0.5 * .05 = 0.25 Eqns 3.9
The expected genotype frequencies are same as the observed genotype frequencies (from the
microsatellite data). This tells us that our population is in Hardy-Weinberg equilibrium. If the
expected genotype frequencies calculated from the allele frequencies were not the same as
the observed genotype frequencies our population would not be in Hardy-Weinberg
equilibrium -- we assess whether the difference is statistically significant using a chi-square
test, as we will see shortly. [Note that statistical significance is not a guarantee of biological
significance].
The expected frequency distribution of genotypes AA, Aa, and aa in proportions p2, 2pq and
q2 respectively is called the Hardy-Weinberg equilibrium. If the population meets the eight
assumptions listed above, then the population will go to the Hardy-Weinberg equilibrium in
the first generation, and remain there. Again, the Hardy-Weinberg principle and its predicted
equilibrium, is a simple model that serves as a starting point for examining the genetic
structure of populations.
Violating Hardy-Weinberg assumptions
How likely are we to meet the major assumptions of random mating, no drift, no mutation, no
migration, and no natural selection? If we violate the assumptions, how much difference does
it make? Here is a list of processes that violate the Hardy-Weinberg assumptions and some
discussion of each of them. These "big five" forces are the major engines of evolutionary
change. An important point is whether the given force tends to increase or decrease the
genetic variability in populations.
• Non-random mating (tends to reduce genetic variation)
Random mating means that alleles (as carried by the gametes -- eggs or sperm) come together
strictly in proportion to their frequencies in the population as a whole. Example: if p = 0.6
and q = 0.4, then the probability of an Aa heterozygote is 0.48 (the product of the allele
frequencies, plus consideration of the fact that two ways exist to make a heterozygote; see
Fig. 3.1). Situations where the random mating assumption does not hold include:
 Inbreeding — cases where relatives (e.g., siblings, cousins) have a greater

probability of mating with each other than with other members of the population.
Inbreeding will tend to decrease heterozygosity without affecting allele frequencies.
 Geographic structuring — in many cases individuals are more likely to mate with
geographically proximate individuals than with more distant individuals.
Geographic structuring is essentially an extended form of inbreeding.
 Positive/Negative Assortative mating — in positive assortative mating (usually
called just assortative mating) individuals of a given phenotype or genotype tend to
mate with similar individuals (e.g., A1A1 tend to mate with other A1A1). Assortative
mating will decrease heterozygosity (put like alleles together) without affecting gene
frequencies.
In negative assortative mating (usually called disassortative mating) individuals tend
to mate with dissimilar individuals.
Disassortative mating will tend to increase heterozygosity (put unlike alleles together)
without affecting gene frequencies.
 Rare allele advantage. In some mating systems a male bearing a rare allele will have
a mating advantage.
Rare allele advantage will tend to increase the frequency of the rare allele and hence
increase heterozygosity.
 Mating system effects — in a polygynous mating system one or a few males that
obtain a disproportionate share of the matings will be over-represented genetically
(this differs from the rare allele effect mainly in that the male's success is not
dependent on having rare alleles -- any rare alleles he does happen to have, however,
will increase in frequency in the next generation). Variance in mating success can
change both gene frequencies and the level of heterozygosity (up or down will depend
on the genotypes of the successful males relative to the frequencies in the population).
Often, the impact of a moderate amount of non-random mating has a negligible impact on our
conclusions about the patterns and causes of genetic variation.
• Random genetic drift (always reduces genetic variation)
The effect of random genetic drift is inversely proportional to population size. Allele
frequencies change because the genes appearing in offspring are not a perfectly representative
sampling of the parental genes (in a finite population). Since drift is a random process,
outcomes of drift must be stated as probabilities. Drift removes genetic variation from the
population at a rate inversely proportional to population size. As population size decreases
the force of drift increases, and vice versa. Drift also affects the probability of survival of new
mutations. The probability that an allele will move to fixation is equal to its frequency in the
population -- an allele with a frequency of 0.2 (20%) has a 20% chance of fixation. New
alleles introduced by mutation almost inevitably begin at low frequencies and have a low
probability of fixation. Drift can lead to the loss of rare alleles and the fixation of common
alleles. If the population is large, however, drift has little effect.
Marble analogy: Think of a jar containing a million marbles in ten different colors. If we
draw a random sample of 500,000 it will almost certainly contain all the marbles in
proportions very similar to the original proportions. If we pick only five marbles, however,
we will definitely have a biased sample (we can’t have picked more than 5 of the 10 alleles --
any duplicates and we'll have even fewer alleles). Even if we take a sample of 50, we will be
unlikely to maintain the proportions of the original million -- the small sample prevents us
from drawing a representative array. Similarly, drift is inversely proportional to population
size -- large population = minor drift, small population = major drift.
Drift can have major effects on endangered (small, almost by definition) populations. For
other species it can take a long time (thousands, hundreds of thousands or even millions of
years) for drift to have large effects.
Fig. 3.2. Computer simulation of genetic drift. The fate of the A1 allele (with
frequency p, on the Y-axis) is shown in five replicate populations for a time
course of 100 generations (time on the X-axis). Note that if p drops to 0 or
rises to1.0 then A1 will be lost (0) or reach fixation (1.0). Those frequencies (0
and 1.0) are therefore called "absorbing boundaries". Notice also the jagged
trajectories that often characterize random processes.
• Selection (reduces genetic variation)
Selection is the differential survival and reproduction of phenotypes that are better suited to
the environment or to obtaining mating success. Selection is the evolutionary force
responsible for adaptation to the environment. Selection generally removes genetic variation
from the population (occasionally special circumstance such as "frequency-dependent" or
"balancing" selection can serve as forces maintaining variation). Alleles that confer
advantages in survival or reproduction will tend to be represented in greater proportion in the
next generation. After numerous generations (the time required will depend on the intensity
of selection and the heritability of the trait), the advantageous allele will tend to spread to
fixation. It is sometimes useful (and almost always interesting) to distinguish, as Darwin did,
between natural and sexual selection.
If drift and natural selection tend to reduce genetic variation, what maintains or increases it?
-- Mutation.
• Mutation (increases genetic variation and introduces novel variants)
Mutation is the process that produces a gene or chromosome set differing from the wild-type
(ancestral allele). Mutation restores genetic variation to a population by producing novel
alleles. Mutation is difficult to measure or observe directly, and rates of mutation can vary
between loci. It is usually a weak force and therefore tends not to pull populations very far
from Hardy-Weinberg equilibrium -- over long enough time periods, though, even a weak
force can have major effects (e.g., the erosion of the Grand Canyon). Much of the neutral
theory of genetic variation is based on a calculation of the balance between drift and mutation
as forces of change.
• Genetic Migration (distributes and homogenizes genetic variation)
Genetic migration is the permanent movement of genes from one population into another.
Migration can restore genetic variation into isolated and differentiated populations or reduce
variation among populations when it occurs frequently. Assessing the patterns and
importance of genetic migration (often referred to as "gene flow") is one of the major aims of
population genetics. [Note that this definition of migration is very different from that for the
seasonal back and forth movement of birds, for example, from breeding grounds in the
temperate zone to non-breeding grounds in the tropics. Migration, in that sense may have
little effect on permanent movement of alleles].
Some absolute basics about probability and combination theory:
Much of population genetics involves manipulations of equations that have a base in either
probability theory or combination theory. We saw combination theory in action when we
used the formula for the number of distinct unrooted trees as a function of the number of
OTUs. The basic Hardy-Weinberg equation p2 + 2 pq + q2 is a probabilistic one (with the
addition that since order is unimportant we account for two ways to get heterozygotes).
Rule 1: If you account for all possible events, the probabilities sum to 1. [e.g., p + q = 1 for a
two-allele system].
Rule 2: The probability that two independent events occur is the product of their individual
probabilities.
[e.g., probability of a homozygote is q*q = q2].
Punch line: Genetic techniques examine individual variation to discern the emergent
properties of populations and higher taxa. We can examine genetic variation at multiple
scales -- from the level of the individual (e.g., forensics applications) to analysis of higher
taxa in systematic and taxonomic studies. Population genetics integrates a broad spectrum of
process and pattern -- geneticists simplify by including only essential forces in their models
and by making simplifying assumptions that, if violated, do not change the qualitative
conclusions. A traditional first step is to build from the Hardy-Weinberg principle -- despite
its admittedly unrealistic assumptions of random mating, no drift, no mutation, no migration,
and no natural selection. In situations where one or more of these assumptions is clearly
violated in a major way, a variety of more complex models can then be brought to bear on the
problem.
Lecture 4. Population Genetics II
Heterozygosity, HExp (or gene diversity, D)
Go to web page describing how to calculate FST from heterozygosities.
Heterozygosity is of major interest to students of genetic variation in natural populations. It is

often one of the first "parameters" that one presents in a data set. It can tell us a great deal
about the structure and even history of a population. Just for example, very low
heterozygosities for allozyme loci in cheetahs and black-footed ferrets indicate severe effects
of small population sizes (population bottlenecks or metapopulation dynamics that severely
reduced the level of genetic variation relative to that expected or found in comparable
mammals). High heterozygosity means lots of genetic variability. Low heterozygosity means
little genetic variability. Often, we will compare the observed level of heterozygosity to what
we expect under Hardy-Weinberg equilibrium (HWE). If the observed heterozygosity is
lower than expected, we seek to attribute the discrepancy to forces such as inbreeding. If
heterozygosity is higher than expected, we might suspect an isolate-breaking effect (the
mixing of two previously isolated populations).
Several measures of heterozygosity exist. The value of these measures will range from zero
(no heterozygosity) to nearly 1.0 (for a system with a large number of equally frequent
alleles). We will focus primarily on expected heterozygosity (HE, or gene diversity, D, as
Bruce Weir prefers to call it). The simplest way to calculate it for a single locus is as:
Eqn 4.1
where pi is the frequency of the ith of k alleles. [Note that p1, p2, p3 etc. may correspond to
what you would normally think of as p, q, r, s etc.]. If we want the gene diversity over several
loci, we need double summation and subscripting as follows:
Eqn 4.2
where the first summation is for the lth ("ellth") of m loci. [Note that we average over the m
loci via the 1/m term]. The second summation is as in Eqn 4.1.
Why does it work to take the sum of the squared gene frequencies and subtract that from one?
Let’s think back to basic Hardy-Weinberg:
p2 + 2 pq + q2 = 1 Eqn 4.3
where the heterozygosity is given by 2pq. The rest of the expression (p2 + q2) is the
homozygosity. If we want the heterozygosity, we just subtract that from the total. With just
two alleles it isn't as efficient to calculate the heterozygosity by the "one minus the
homozygosity route". Consider the case, though, of a locus with 6 alleles. It has 21 possible
genotypes -- 6 kinds of homozygotes and 15 kinds of heterozygotes. Writing it out, 6 + 5 + 4
+ 3 + 2 + 1 = 21 = [6*(6+1)]/2 -- this is the formula for combinations of six things taken two
at a time, order unimportant -- [n(n+1)] / 2. The more alleles, the simpler it becomes simply
to square the gene frequencies and sum then, compared to enumerating all possible
heterozygotes and calculating the (possibly very many) different heterozygote frequencies.
We trade a little inefficiency on two-allele systems for much greater efficiency with multi-
allele systems.
What does heterozygosity tell us, and what patterns emerge as we go to multi-allelic systems?
Let’s take an example. Say p = q = 0.5. The heterozygosity for a two-allele system is
described by a concave down parabola that starts at zero (when p = 0) goes to a maximum at
p = 0.5 and goes back to zero when p = 1. In fact, for any multi-allelic system, heterozygosity
is greatest when
p1 = p2 = p3 = ….pk Eqn 4.4

that is, when the allele frequencies are equal. The maximum heterozygosity for a 10-allele
system comes when each allele has a frequency of 0.1 -- D or HE then equals 0.9. Later, we
will see that the simplest way to view FST (a measure of the differentiation of subpopulations)
will be as a function of the difference between the Observed heterozygosity, Ho, and the
Expected heterozygosity, HE, that we have just derived.
Individual’s-eye view of heterozygosity
Here is a way that I like to think of heterozygosity (HE or D). It is the (expected) probability
that an individual will be heterozygous at a given locus (or over the assayed loci for a multi-
locus system). For many human microsatellite loci, for example, HE is often > 0.85, meaning
that you have a > 85% chance of being a heterozygote.
Now that you have a way to calculate gene diversity/expected heterozygosity, you are ready
to calculate F-statistics by the method of:
FIS = (HS - HI) / HS Eqns 4.5
FST = (HT - HS) / HT
FIT = (HT - HI) / HT
As shown in the worked F-statistic web page demo.
If you run some data through Eqns 4.5 and an analysis program you may ask:
"Why is the FST I calculate with FSTAT (or some other

software)
different from the one I calculate using Eqns 4.5?"
Answer: because the analysis programs use more complex algorithms that take into account
such factors as how individuals disperse (island model vs. stepping-stone model vs. lattice
model), the mutation process (infinite alleles model vs. stepwise mutation model) and various
bias adjusters (e.g., taking into account the sample size of the subpopulations sampled).
Lecture 8. Population Genetics VI: Introduction to
microsatellites: from theory to lab. practice.
A. What are microsatellites?
B. What uses do microsatellites serve?
C. How we develop microsatellite primers?
D. How do we screen DNA with species-specific or heterospecific primers?
E. What data-analysis tools are available?
Go to primer on microsatellites on Dave McDonald's web page

A. What are microsatellites?
Microsatellites are simple sequence tandem repeats (SSTRs). The repeat units are
generally di-, tri- tetra- or pentanucleotides. For example, a common repeat motif in
birds is ACn, where the two nucleotides A and C are repeated in bead-like fashion a
variable number of times (n could range from 8 to 50). They tend to occur in non-
coding regions of the DNA (this should be fairly obvious for long dinucleotide
repeats) although a few human genetic disorders are caused by (trinucleotide)
microsatellite regions in coding regions. On each side of the repeat unit are flanking
regions that consist of "unordered" DNA. The flanking regions are critical because
they allow us to develop locus-specific primers to amplify the microsatellites with
PCR (polymerase chain reaction). That is, given a stretch of unordered DNA 30-50
base pairs (bp) long, the probability of finding that particular stretch more than once
in the genome becomes vanishingly small (if the four nucleotides occur with equal
probability then the probability of a given 50 bp stretch is 0.2550. In contrast, a given
repeat unit (say AC19) may occur in thousands of places in the genome. We use this
combination of widely occurring repeat units and locus-specific flanking regions as
part of our strategy for finding and developing microsatellite primers. The primers for
PCR will be sequences from these unique flanking regions. By having a forward and a
reverse primer on each side of the microsatellite, we will be able to amplify a fairly
short (100 to 500 bp, where bp means base pairs) locus-specific microsatellite region.
Mutation process: Microsatellites are useful genetic markers because they tend to be
highly polymorphic. It is not uncommon to have human microsatellites with 20 or
more alleles and heterozygosities (Hexp = gene diversity, D) of > 0.85. Why are they
so variable? The reason seems to be that their mutations occur in a fashion very
different from that of "classical" point mutations (where a substitution of one
nucleotide to another occurs, such as a G substituting for a C). The mutation process
in microsatellites occurs through what is known as slippage replication. If we envision
the repeat units (e.g., an AC dinucleotide repeat) as beads on a chain, we can imagine
that during replication two strands could slip relative positions a bit, but still manage
to get the zipper going down the beads. One strand or the other could then be
lengthened or shortened by addition or excision of nucleotides. The result will be a
novel "mutation" that comprises a repeat unit that is one bead longer or shorter than
the original. The idea that adding or subtracting one repeat is likely easier than adding
or subtracting two or more beads is the basis for using the Stepwise Mutation Model
(SMM) as opposed to the Infinite Alleles Model (IAM). An advantage of the SMM
(at least in theory) is that the difference in size then conveys additional information
about the phylogeny of alleles. Under the IAM the only two states are "same" and
"different". Under the SMM we have a potential continuum of different similarities
(same size, similar in size, very different in size). If, however, the SMM does not
hold, then we may be worse off using it -- it may actually be highly misleading. Even
if the underlying mutation process is largely stepwise, it is not difficult to see how
drift might affect the distribution of allele sizes in a way that would almost entirely
invalidate the SMM (visualize this by examining Figs. 6.1 and 6.2 in Lecture 6).
Advantages of microsatellites as genetic markers:
Locus-specific (in contrast to multi-locus markers such as minisatellites or

RAPDs)
Codominant (heterozygotes can be distinguished from homozygotes, in
contrast to RAPDs and AFLPs which are "binary, 0/1")
PCR-based (means we need only tiny amounts of tissue; works on highly
degraded or "ancient" DNA)
Highly polymorphic ("hypervariable") -- provides considerable pattern
Useful at a range of scales from individual ID to fine-scale phylogenies
B. What uses do microsatellites serve?
Microsatellites are useful markers at a wide range of scales of analysis. Until recently,
they were the most important tool in mapping genomes -- such as the widely
publicized mapping of the human genome. They serve a role in biomedical diagnosis
as markers for certain disease conditions. That is, certain microsatellite alleles are
associated (through genetic linkage) with certain mutations in coding regions of the
DNA that can cause a variety of medical disorders. They have also become the
primary marker for DNA testing in forensics (court) contexts -- both for human and
wildlife cases (e.g., Evett and Weir, 1998). The reason for this prevalence as a
forensic marker is their high specificity. Match identities for microsatellite profiles
can be very high (probability that the evidence from the crime scene is not a match
with that of the suspect is < one in many millions in some cases). In a
biological/evolutionary context they are useful as markers for parentage analysis.
They can also be used to address questions concerning degree of relatedness of
individuals or groups. For captive or endangered species, microsatellites can serve as
tools to evaluate inbreeding levels (FIS). From there we can move up to the genetic
structure of subpopulations and populations (using tools such as F-statistics and
genetic distances). They can be used to assess demographic history (e.g., to look for
evidence of population bottlenecks), to assess effective population size (Ne) and to
assess the magnitude and directionality of gene flow between populations.
Microsatellites provide data suitable for phylogeographic studies that seek to explain
the concordant biogeographic and genetic histories of the floras and faunas of large-
scale regions. They are also useful for fine-scale phylogenies -- up to the level of
closely related species. An overview by Selkoe and Toonen (2006) provides a useful
practical guide to the use of microsatellites as genetic markers.
Limits to utility of microsatellites: Microsatellite DNA is probably rarely useful for

higher-level systematics. That is because the mutation rate is too high. Across highly
divergent taxa two problems arise. First, the microsatellite primer sites may not be
conserved (that is the primers we use for Species A may not even amplify in Species
B). Second, the high mutation rate means that homoplasy becomes much more likely
-- we can no longer safely assume that two alleles identical in state are identical by
descent (from a common, meaning shared not abundant, ancestor). As a concrete
example imagine two species, each with an AC19 allele that occurs at high frequency.
If the populations diverged long ago it becomes increasingly likely that the way those
alleles arose took different pathways (e.g., in one species the AC19 arose from an
ancestor that went from AC18 to AC19 to AC20 then back to AC19; in the other species
the ancestral AC18 went to AC19 and stayed there. Any inferences we make about the
species relationships based on the AC19 similarity would be misleading). The identity
in state does not correspond to the identity by descent that provides (reliable)
phylogenetic signal. A further potential drawback of using microsatellites is that we
tend to have relatively few loci to work with (4-20). In some situations, that raises the
probability of having a bias due to forces such as selection acting on one or more loci
that may give a misleading impression relative to the true pattern of change for the
genome as a whole.
C. How do we develop microsatellite primers?

We are interested in conducting a genetic analysis of Species X using microsatellites,
because we decide that microsatellites will provide the most information per unit
effort and cost. How do we go about developing primers? If someone has developed
primers for a closely related species, those primers will be well worth checking in our
species. If, however, no primers have been developed for related species, we may
need to develop our own. We do so by a sequence of steps:
1) Extract DNA from tissue (wide variety of possible methods depending upon tissue
type)
2) Fragment the genome. Cut our genomic DNA into suitable size fragments with
restriction enzymes. Generally, restriction enzymes that produce mean fragment
sizes in the range of 300-600 bp are the desired goal.
3) Insert. Insert the fragments into plasmids. This step allows cloning of the
fragments -- producing many copies of the 300-600 bp pieces we have inserted in the
plasmids. To get a slightly more detailed idea of how plasmids act as cloning vectors,
look up the boldface terms in the glossary of terms page. PUC19 is a commonly used
plasmid for this sort of analysis. Why PUC19? The restriction sites in PUC19 are
known (so that the ligated DNA fragments can later be cut out) and it replicates well
in a bacterial culture.
4) Plate the plasmids on a nylon membrane.
5) Probe the membrane with labeled oligonucleotides of desirable repeats (e.g.,

AC10).
6) Culture the positive clones (the plasmid-fragments that bonded with the oligo
probes).
7) Cut the insert out of the plasmids with restriction enzymes and run them out on an
agarose gel.
8) Probe. Use Southern transfers to probe the digest again with labeled oligos. This
serves:
a) to verify the presence of the repeat and

b) to allow us to estimate the size of the insert.
9) Sequence the positive clones that make it through all the above selection steps.
10) Select. Analyze the sequence to check for "good" primer sites and useful repeat
length (generally at least 8 repeats and it is often best to have more -- depending upon
our intended application we may want long pure repeats or we may be interested in
shorter interrupted repeats, which may have lower mutation rates). Criteria that enter
into primer selection include:
a) "compatibility" of the two primers (they can’t be complementary because

that would cause cross-binding, they need to have very similar lengths and
melting temperatures),
b) avoidance of stop codes or other sequences that would cause PCR failures,
c) avoidance of primer initiation sites that won’t bind well, avoidance of
palindromes (sequences that have the same sequence from either end) and a
number of others.
d) total amplified product lengths of 100-250 bp, so that they are feasible for
the sequencing gels or automated genotypers we will use for visualization.
e) avoidance of repeats near end of sequenced region. Some of the positive
clones we have sequenced may have good repeat units, but be too close to the
end of the sequence. We then lack enough flanking region with which to
design a primer. That, in part, is why we want fragments of 300-600 bp --
short enough to be feasible for sequencing, but long enough to reduce the
likelihood that the repeat will be a "cliff-hanger."
Several software packages are available that can help in primer selection (Oligo,
Primer!, MacVector).
11) Order the locus-specific primers (generally these will be 20-30 bp sections of the
flanking regions not immediately adjacent to the repeat unit).
Here is an example of a microsatellite sequence for scrub-jays that contains a repeat

unit and forward and reverse primer sites.
SJR3 [FSJ]
GCCAAGCTTGCATGCCTGCAGGTCGACTCTAGAGGATCCCCAAGTGTAT
GTGCATACACGTG
CACACACACACACACACACACAGAGGGTGTGCACATGTGCATGCACACT
CCAAGAGACAGTG
CCTAGTAAAGTGTCTCAGCACCATCTGCAGCAAACAGGTTCTGCAAAAA
CCAATCCCAACTGA
TGTTCCCACAGTGACACTGT
From beginning of forward primer to end of reverse primer, the above is 131 bp
Repeat is CA11
The repeat unit is highlighted in red, while the forward and reverse primers are
highlighted in blue and green. We would send out an order for the primer sequences
(in our case we add an additional 19 bp M13 tail, which allows us to attach
fluorescent nucleotides/dNTPs to our amplified product in the PCR). A laser in our
sequencer/automated genotyper then detects the fluorescence, which is how we
visualize the bands that constitute the allelic data we hope to gather and analyze.
Strassmann et al. (1996) has a more detailed run-through of much of this section.
D. How do we screen DNA with species-specific or heterospecific primers?

Screening existing microsatellite primers has been a major focus of research in my
lab. Past projects include those of Sam Wisely (now on the faculty at Kansas State
University; genetics of black-footed ferrets and other mustelids), Nicole Korfanta
(genetic structure of migratory vs. resident populations of burrowing owls) and Marni
Koopman (genetic structure of Boreal Owls). We may do a quick a guided tour of the
laboratory procedures from DNA extraction from tissue (hair, blood, muscle etc.) to
visualizing the amplified DNA on an ABI automated DNA sequencer in the Nucleic
Acid Exploration Facility (NAEF). Here are the basic steps:
1) Extract the DNA. One often begins by somehow breaking up the tissue (e.g., by
grinding in liquid nitrogen). Alternatives for the extraction process include classic
phenol-chloroform extractions, salt-based extractions, and a variety of commercial
kits. We are getting rid of proteins and other non-DNA tissue components in this step.
A typical analysis might include extracting DNA from each of the individuals in a
local population of 30 individuals.
2) Amplify. We add a very small amount of each of our 30 samples of extracted DNA
to a PCR cocktail for amplification in a thermocycler. This is a "magic" step that has
revolutionized molecular biology. We start with almost no DNA and wind up with
enough that we can see it on a gel! Various "cocktail" recipes exist -- they typically
contain the thermophilic bacterial enzyme Taq polymerase (essential), the dNTP mix
(nucleotides that will allow massive replication of our target DNA), magnesium
chloride, and the fluorescently labeled dNTPs (these will bind to the specially added
M13 or T3 tail and light up under the laser and make bands of DNA alleles show up
on the gel).
3) Load. We load our 30 amplified products in separate lanes in a large vertical

polyacrylamide gel. We also load several lanes with a DNA ladder -- known-size
fragments of amplified DNA of known quantity/concentration. A common ladder is
lambda phage cut with restriction enzymes to yield a series of fragments. The newer
capillary sequencers don't use a gel.
4) Run the sequencer. We run the amplified product through the sequencer until all
the alleles have had time to run by the laser, which illuminates the fluorescent
nucleotides and makes bands light up on the gel (or go digital-direct to the computer).
The sequencer generates both an analog image (for older, gel-based sequencers) and
digitally stored data concerning the size of the fragments.
5) Optimize (variations on Steps 2-4). It often takes considerable fiddling to get the
PCR conditions right for a particular combination of primer, DNA, thermocycler and
sequencer. Major variables in optimization include:
temperature (the primer sequence will have a predicted melting temperature but what
actually works may be higher or lower),
the PCR-programmed times for denaturing, annealing and extending steps
magnesium chloride concentrations
Alternative methods of visualization include "hand-built" polyacrylamide sequencing

gels with silver-staining, CyberGreen staining, ethidium bromide staining or
radioactive labeling. Many of these involve nasty chemicals (EtBr) or radioactivity, so
we feel fortunate to be using a relatively clean, safe procedure.
Fig. 8.1. Stylized diagram of an electrophoretic gel for microsatellites. A current

draws amplified DNA down
"lanes" in the polyacrylamide gel. The fragments can then be separated by size (bp =
base pairs) and individuals
can be genotyped for their allelic composition (homozygote or heterozygote for one or
more alleles). Here
the left-hand lane has a "ladder" of known-size fragments, the second lane has the
DNA from one individual
(genotype bc) and the third lane has the DNA from a second individual (genotype ad).
Running multiple loci
provides a wealth of genetic information about individuals, populations or species.
Fig. 8.2. Representative microsatellite and gender probe gel. DNA was amplified by
PCR and run out on a Li-Cor
automated sequencer for scoring by fragment size (number of base pairs). The
individuals are WY black bears.
E. How do we analyze the allelic information? For a slightly more detailed description
go to the Genetic analysis page.
You can also download my Word document on Web Genetic software. Luikart and England
(1999) provides an (older) overview of approaches. For use of alternative markers see papers
(mostly from TREE) by Sunnucks (2000), Mueller and Wolfenbarger (1999; AFLP),
Campbell et al. (2003; AFLP) and Brumfield et al. (2003; SNPs - single nucleotide
polymorphisms).
1) Traditional population genetics tools

Heterozygosity (Hobs, Hexp = D)
Hardy-Weinberg equilibrium
Linkage disequilibrium
FST and other F-statistics
Genetic distances (Cavalli-Sforza chord, Nei’s 1972 and 1978 distances)
Estimates of 4Ne and 4Nem. ( for mutation, m for migration)
2) Microsatellite specific measures (mostly relying on SMM, Stepwise Mutation

Models)
(delta mu squared) of Goldstein et al. 1995

DSW of Shriver et al. (1995)
RST of Slatkin (1995) as implemented by Goodman (1997)
of Michalakis and Excoffier (1996)
3) Newer phylogeographic and population genetic tools

Coalescent inferences (Beerli and Felsenstein, 1999; Rannala and Mountain,
1997)
Assignment tests (Davies et al., 1999; software DOH.html)
Assessment of whether the population is panmictic or shows distinct partitions
(Pritchard et al., 2000 and program Structure)
Asymmetric migration analyses (Beerli and Felsenstein, 1999)
Comparisons and contrasts with maternally inherited mitochondrial DNA
structure (Piertney et al. 2000; Chesser and Baker 1996).
Assessment of prior bottlenecks.
Lecture 6. Population Genetics IV: Genetic distances --
biological vs. geometric approaches.
Go to web page outlining major aspects of analyzing genetic population structure

(WAAP.html)
(some important measures to calculate, very basic intro. to the practicalities of running a few
of the many software choices)
Taxonomy of genetic distance measures.

We began our study of population genetics by developing the concept of hetero- and
homozygosity from Hardy-Weinberg principles. We used a Hardy-Weinberg approach as one
way to get at a measure of subpopulation differentiation in terms of F-statistics. The F-
statistics provide a view of the variance structure of populations, and can provide an overall
comparison of the degree to which populations are structured
FST = 0 meaning no structure, no differentiation, and
FST = 1 meaning completely differentiated;
FIS = 0 meaning neither inbreeding nor outbreeding (i.e., meeting the random mating Hardy-
Weinberg expectation),
FIS = 1 meaning completely inbred,
FIS = -1 meaning completely outbred.
Go to web page describing how to calculate FST from heterozygosities (FST.html)
F-statistics do not, however, easily allow pairwise comparisons among subpopulations or

populations. That is, we can assess pairwise FST between populations, but those pairwise
"distances" take account only of the data for the two populations concerned, not all the data
simultaneously. We would like a way to quantify the degree to which A differs from B, B
from C, and A from C from the entire pool of data. We can do this in two major ways -- with
or without underlying biological models. The latter (no biological assumptions or model) are
also known as geometric distances. These geometric distances include Rogers’ and Cavalli-
Sforza chord distances. Distance measures that do make biological assumptions include
Reynolds’ and Nei’s distances. Let’s examine each in turn.
1) Distance methods with no biological assumptions. A locus-specific, codominant marker
population genetic data set, such as the bear one you have used for homeworks, consists of a
set of individual- and population-indexed gene frequencies at one or more loci. We can
analyze these data as a set of numbers without making any biological assumptions.
Approaches could include principal components analyses (PCA), Euclidean distances or
somewhat more complex geometric distances. Many of these will allow us to create a sort of
abstract "map" of the populations in one, two, three or more dimensions (obviously, maps
with dimensionalities > 3 are hard to visualize). Some of these maps can be condensed into
matrices of distances. Here’s an example using real microsatellite data for Western Scrub-
Jays (Aphelocoma californica).
Table 6.1. Cavalli-Sforza chord distances for five populations of Western Scrub-Jays,
Aphelocoma californica.
WOb3 WSp3 WCal WOoc WSp2
WOb3 0 0.0332 0.0492 0.0428 0.0466
WSp3 0.0332 0 0.0488 0.0645 0.0449
WCal 0.0492 0.0488 0 0.0617 0.0533
WOoc 0.0428 0.0645 0.0617 0 0.058
WSp2 0.0466 0.0449 0.0533 0.058 0
The table entries are Cavalli-Sforza chord distances (Cavalli-Sforza and Edwards,
1967; described on pp. 163-166 of Weir, 1996) between five jay populations. For
example, the "distance" between population WSp3 and population WOb3 is 0.0332,
which is smaller than the distance of 0.0488 between WSp3 and WCal. The matrix is
symmetrical (A to B = B to A) and has zeros on the diagonal (A to A = 0).
How did we get these Cavalli-Sforza distances? They are simply a geometric view of
the distances between multi-dimensional points on a hypersphere (a sphere with > 3
dimensions). Say we have two subpopulations S1 and S2 assayed at a single locus
with alleles i = 1 to k. The formal definition is:
Eqn 6.1
That is, we take the square root of the frequency of allele 1 in S1 times that of allele 1
in S2 and repeat and sum that quantity for all k alleles. That gives us Cos () which
we can plug into the square-root term on the RHS (right hand side) of Eqn 5.1 above.
I don’t expect you to use or memorize this -- just to see that it is a purely
numerical/geometric approach. If we were doing it in 3 dimensions it would be akin
to figuring out the distance from New York to London along the surface of the globe
(called the chord distance). It can be fairly easily incorporated into a number-
crunching computer program that will produce output like the table of Cavalli-Sforza
distances shown above. Those distances, in matrix form, can then be used as input for
phylogenetic tree-building routines such as the UPGMA, Fitch-Margoliash and
neighbor-joining approaches we used in the homeworks.
The Cavalli-Sforza chord distance was an early measure and is still used (in fact I see
it gaining ground for use with microsatellites). Another geometric distance that was
widely used with allozymes (but I have not seen used with microsatellite data) is
Rogers’ distance (Wright, 1978). One reason the Cavalli-Sforza distance may be in
greater current use is that it was specifically evaluated (and performed well) in
simulations of tree-building algorithms by Takezaki and Nei (1996). [For all we know
Roger’s distance may perform equally well or better under circumstances that would
apply well to the questions people like me seek to address -- but since no one has
done such a study, people like me will tend to go with one that has a documented
good track record]. A very important part of the robustness of a distance measure is its
performance under a variety of conditions. It is always best if we can compare several
distance measures under conditions in which we know what the answer should be.
Paetkau et al. (1997) provide an evaluation of various distance measures that apply to
distance measures potentially useful for microsatellite analysis of bear populations.
2) Distance methods with biological assumptions. With a little luck (or a lot of hard
work), we know something about the evolutionary forces (most importantly here
mutation and drift, since we assume we are using markers that are not subject to
natural selection) driving genetic change in the system we're interested in. If so, it
seems reasonable to take advantage of that knowledge by incorporating it into
building a distance model. After all, we expect models with greater realism to perform
better (albeit at the cost of greater complexity, usually). Several distance measures
incorporate assumptions about the importance of drift and mutation as forces of
change:
Reynolds’ distance or the "coancestry" distance (Reynolds et al., 1983; see Weir,
1996, p. 167)
Nei’s distance (Nei 1972, 1978)
Models using a stepwise mutation model (SMM) specifically developed for
microsatellites (e.g., 2[delta mu squared] of Goldstein et al., 1995).
The problem with making assumptions is that violations can cause errors.
Empirically, it appears that many of the stepwise mutation models for microsatellites
do not perform well when analyzing many (most?) data sets, especially those where
small population sizes mean that drift has played at least as large a role as mutation.
Reynolds’ distance, which was derived for allozyme data on small (e.g., vertebrate)
populations assumes a primary role for drift and is an infinite-alleles model (an allele
can change from any given state into any other given state). Reynolds’ reliance on
"drift only" seemed inappropriate for microsatellites, which have:
a) a mutation rate that appeared clearly much larger than that of allozymes (1
mutation per 1,000 or 10,000 replication events for microsatellites vs. 1 mutation per
1,000,000 replication events for allozymes). [But that may be based on very long
repeats in highly polymorphic human populations].
b) a mutation process that would seemingly not fit the infinite-alleles model because
mutations generally occur in "stepwise" fashion by adding or deleting one of a series
of beads (AC10 goes to AC9 or AC11, where the subscript refers to the number of AC
repeat units).
[See my web page http://www.uwyo.edu/zoology/mcdonald/dna.htm for a quick
overview of microsatellites].
Nevertheless, Reynolds' distance and its neglect of the importance of mutation, may
work better than we would have expected (at least in some species/populations) for
two reasons:
a) small population sizes (= high potential for drift)
b) "missing steps" because drift creates a "chunky" distribution of alleles instead of
the smooth bell curve we would expect under a strict stepwise process.
Fig. 6.1. A microsatellite allele frequency distribution under a strict stepwise

mutation model (SMM). The X-axis shows the number of repeat units (e.g.,
AC8 to AC19), while the Y-axis shows the number of alleles. Starting with
either a 13 or 14 repeat chain as the ancestor, we tend to accumulate more
alleles at sizes close to the starting point because of equal likelihood of
additions or subtractions and because larger changes (a variant of "mutations
of large effect") will tend to be rare (we think).
Fig. 6.2. An allele frequency distribution that has been greatly affected by drift
and may better fit an infinite-alleles model (IAM). Even if the mutations that
generated the original variation did occur in stepwise fashion, drift has
removed some allele sizes (e.g., the 10-repeat category) while randomly
selecting others to be greatly over-represented (e.g., 12, 15 and 17). This sort
of "chunky" distribution may be quite common in many natural populations of
vertebrates (where effective population sizes, Ne, are always small or at least
often fluctuate to low numbers).
Infinite-alleles vs. stepwise mutation models: Infinite-alleles models were the

standard approach for most allozyme analyses because it was difficult or impossible
to predict the state of a mutation from knowledge of the state of its ancestors. That is,
given that we had one allelomorph (protein/enzyme of a given size/charge that
showed a particular electrophoretic banding profile) it was not at all clear how it
related to other allelomorphs. In a stepwise model, in contrast, one assumes that an
allele of a given size has as its ancestor either the allele one repeat unit longer or the
allele one repeat unit shorter
Genetic diversity
From Wikipedia, the free encyclopedia
Jump to: navigation, search
Part of a series on
Evolutionary biology
Diagrammatic representation of the

divergence of modern taxonomic
groups from their common ancestor.
Key topics[show]
Processes and outcomes[show]
Natural history[show]
History of evolutionary theory[show]
Fields and applications[show]


Social implications[show]
 Evolutionary biology portal
 Category
 Book
 Related topics
 v
 t
 e
Genetic diversity, the level of biodiversity, refers to the total number of genetic
characteristics in the genetic makeup of a species. It is distinguished from genetic variability,
which describes the tendency of genetic characteristics to vary.
Genetic diversity serves as a way for populations to adapt to changing environments. With
more variation, it is more likely that some individuals in a population will possess variations
of alleles that are suited for the environment. Those individuals are more likely to survive to
produce offspring bearing that allele. The population will continue for more generations
because of the success of these individuals.[1]
The academic field of population genetics includes several hypotheses and theories regarding
genetic diversity. The neutral theory of evolution proposes that diversity is the result of the
accumulation of neutral substitutions. Diversifying selection is the hypothesis that two
subpopulations of a species live in different environments that select for different alleles at a
particular locus. This may occur, for instance, if a species has a large range relative to the
mobility of individuals within it. Frequency-dependent selection is the hypothesis that as
alleles become more common, they become more vulnerable. This occurs in host-pathogen
interactions, where a high frequency of a defensive allele among the host means that it is
more likely that a pathogen will spread if it is able to overcome that allele.
Contents
[hide]
 1 Importance of genetic diversity

 2 Survival and adaptation
 3 Agricultural relevance
 4 Farm animal biodiversity
 5 Coping with poor genetic diversity
 6 Measures of genetic diversity
 7 Other measures of diversity
 8 See also
 9 References
Importance of genetic diversity[edit]

A 2007 study conducted by the National Science Foundation found that genetic diversity and
biodiversity (in terms of species diversity) are dependent upon each other—that diversity
within a species is necessary to maintain diversity among species, and vice versa. According
to the lead researcher in the study, Dr. Richard Lankau, "If any one type is removed from the
system, the cycle can break down, and the community becomes dominated by a single
species."[2] Genotypic and phenotypic diversity have been found in all species at the protein,
DNA, and organismal levels; in nature this diversity is nonrandom, heavily structured, and
correlated with environmental variation and stress.[3]
The interdependence between genetic and species diversity is delicate. Changes in species
diversity lead to changes in the environment, leading to adaptation of the remaining species.
Changes in genetic diversity, such as in loss of species, leads to a loss of biological
diversity.[1] Loss of genetic diversity in domestic animal populations has also been studied
and attributed to the extension of markets and economic globalization.[4][5]
Survival and adaptation[edit]

Genetic diversity plays an important role in the survival and adaptability of a species.[6] When
a population's habitat changes, the population may have to adapt to survive; the ability of the
population to adapt to the changing environment will determine their ability to cope with an
environmental challenge.[7] Variation in the population's gene pool provides variable traits
among the individuals of that population. These variable traits can be selected for, via natural
selection, ultimately leading to an adaptive change in the population, allowing it to survive in
the changed environment. If a population of a species has a very diverse gene pool then there
will be more variety in the traits of individuals of that population and consequently more
traits for natural selection to act upon to select the fittest individuals to survive.
Genetic diversity is essential for a species to evolve. With very little gene variation within the
species, healthy reproduction becomes increasingly difficult, and offspring are more likely to
have problems resulting from inbreeding.[8] The vulnerability of a population to certain types
of diseases can also increase with reduction in genetic diversity.
Agricultural relevance[edit]
When humans initially started farming, they used selective breeding to pass on desirable traits
of the crops while omitting the undesirable ones. Selective breeding leads to monocultures:
entire farms of nearly genetically identical plants. Little to no genetic diversity makes crops
extremely susceptible to widespread disease. Bacteria morph and change constantly. When a
disease causing bacterium changes to attack a specific genetic variation, it can easily wipe out
vast quantities of the species. If the genetic variation that the bacterium is best at attacking
happens to be that which humans have selectively bred to use for harvest, the entire crop will
be wiped out.[9]
A very similar occurrence is the cause of the infamous Potato Famine in Ireland. Since new
potato plants do not come as a result of reproduction but rather from pieces of the parent
plant, no genetic diversity is developed, and the entire crop is essentially a clone of one
potato, it is especially susceptible to an epidemic. In the 1840s, much of Ireland’s population
depended on potatoes for food. They planted namely the “lumper” variety of potato, which
was susceptible to a rot-causing oomycete called Phytophthora infestans.[10] This oomycete
destroyed the vast majority of the potato crop, and left one million people to starve to death.
Farm animal biodiversity[edit]

In the past 15 years, 190 breeds of farm animals have become extinct and 1,500 are
considered at risk of becoming extinct, out of 7,600 breeds in the Global Databank for Farm
Animal Genetic Resources compiled by the Food and Agriculture Organization. Over the last
five years 60 breeds of cattle, goats, pigs, horses and poultry have been lost. [11]
Coping with poor genetic diversity[edit]

The natural world has several ways of preserving or increasing genetic diversity. Among
oceanic plankton, viruses aid in the genetic shifting process. Ocean viruses, which infect the
plankton, carry genes of other organisms in addition to their own. When a virus containing
the genes of one cell infects another, the genetic makeup of the latter changes. This constant
shift of genetic make-up helps to maintain a healthy population of plankton despite complex
and unpredictable environmental changes.[12]
Cheetahs are a threatened species. Low genetic diversity and resulting poor sperm quality has
made breeding and survivorship difficult for cheetahs. Moreover only about 5% of cheetahs
survive to adulthood.[13] However, it has been recently discovered that female cheetahs can
mate with more than one male per litter of cubs. They undergo induced ovulation, which
means that a new egg is produced every time a female mates. By mating with multiple males,
the mother increases the genetic diversity within a single litter of cubs.[14]
Measures of genetic diversity[edit]

Genetic Diversity of a population can be assessed by some simple measures.
 Gene Diversity is the proportion of polymorphic loci across the genome.

 Heterozygosity is the fraction of individuals in a population that are heterozygous for
a particular locus
 Alleles per locus is also used to demonstrate variability.
Other measures of diversity[edit]

Alternatively, other types of diversity may be assessed for organisms:
 species diversity
 ecological diversity
 morphological diversity
 degeneracy
There are broad correlations between different types of diversity. For example, there is a
close link between vertebrate taxonomic and ecological diversity.[

Heterozygosity

Cargado por

Información del documento

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

Heterozygosity

Cargado por

Copyright:

Formatos disponibles

Assessing genetic structure using codominant, allelic

Download WebSoftware.doc list of web software resources

I. Total variation over the entire set of populations:

III. Among population variation:

A note on the calculation and uses of AE (effective number of

Lecture 3. Population Genetics I.

What’s useful about population genetics?

 Management issues. We can answer questions about patterns and trends in

 Conservation of threatened and endangered species. Assessment of genetic

The Hardy-Weinberg principle (and its predicted equilibrium) is the cornerstone of

Remember (memorize) the five major assumptions that lead to a Hardy-Weinberg

Violations of any of the five major assumptions are the

Expected genotype frequencies:

Frequency of AA = p2 Eqn 3.3

Fig. 3.1. Diagram of Hardy-Weinberg genotype proportions. Given a

Aa = 2pq = 2 * 0.75 * 0.25 = 0.375

aa = q2 = 0.25 * .025 = 0.0625 Eqns 3.6

q = q2 + (2pq/2) = 0.25 + (0.5/2) = 0.5 Eqns 3.8

Aa = 2pq = 2 * 0.5 * 0.5 = 0.5

Aa = q2 = 0.5 * .05 = 0.25 Eqns 3.9

Violating Hardy-Weinberg assumptions

 Inbreeding — cases where relatives (e.g., siblings, cousins) have a greater

• Random genetic drift (always reduces genetic variation)

• Selection (reduces genetic variation)

• Genetic Migration (distributes and homogenizes genetic variation)

Some absolute basics about probability and combination theory:

Return to Main Index page

Heterozygosity is of major interest to students of genetic variation in natural populations. It is

p1 = p2 = p3 = ….pk Eqn 4.4

Individual’s-eye view of heterozygosity

FIS = (HS - HI) / HS Eqns 4.5

FST = (HT - HS) / HT

FIT = (HT - HI) / HT

As shown in the worked F-statistic web page demo.

"Why is the FST I calculate with FSTAT (or some other

A. What are microsatellites?

B. What uses do microsatellites serve?

C. How we develop microsatellite primers?

D. How do we screen DNA with species-specific or heterospecific primers?

E. What data-analysis tools are available?

Go to primer on microsatellites on Dave McDonald's web page

Advantages of microsatellites as genetic markers:

Locus-specific (in contrast to multi-locus markers such as minisatellites or

Limits to utility of microsatellites: Microsatellite DNA is probably rarely useful for

C. How do we develop microsatellite primers?

4) Plate the plasmids on a nylon membrane.

5) Probe the membrane with labeled oligonucleotides of desirable repeats (e.g.,

a) to verify the presence of the repeat and

a) "compatibility" of the two primers (they can’t be complementary because

Here is an example of a microsatellite sequence for scrub-jays that contains a repeat

D. How do we screen DNA with species-specific or heterospecific primers?

3) Load. We load our 30 amplified products in separate lanes in a large vertical

Alternative methods of visualization include "hand-built" polyacrylamide sequencing

Fig. 8.1. Stylized diagram of an electrophoretic gel for microsatellites. A current

1) Traditional population genetics tools

2) Microsatellite specific measures (mostly relying on SMM, Stepwise Mutation

(delta mu squared) of Goldstein et al. 1995

3) Newer phylogeographic and population genetic tools

Go to web page outlining major aspects of analyzing genetic population structure

Taxonomy of genetic distance measures.

F-statistics do not, however, easily allow pairwise comparisons among subpopulations or

Fig. 6.1. A microsatellite allele frequency distribution under a strict stepwise

Infinite-alleles vs. stepwise mutation models: Infinite-alleles models were the

Diagrammatic representation of the