Doubt & Uncertainty
Doubt, or uncertainty, is a factor to be accounted for in all genetic genealogy interpretations. We can (almost) never be sure that statements we make
are correct. Mutations of DNA are random and it's from the probabilities
that we derive the meanings of the results. Uncertainty is the other
side of the coin from confidence, its "evil twin".
Skepticism -- especially, of one's own conclusions -- is a healthy
scientific attitude. This exploration will help us objectively quantify our
"Confidence" is a well-recognized statistical concept, often expressed
either in "confidence limits" or "confidence intervals". There are methods
for assessing the confidence to be placed in conclusions drawn from many
types of data distributions: normal, binomial, chi-square, f, Gauchy,
Student t, etc.
- Confidence intervals refer to the middle part of a probability curve's
sure can we be that the true value of an unknown falls between A & B? For
example, we know that in a normal distribution 90% of values fall within
plus or minus 1.65 standard deviations
of the mean (average).
- Confidence limits refer to one end of a probability curve's area: How sure can
we be that A is either less than or greater than B? For example, we know
that a normal distribution has only 5% of values outside 1.65 standard
deviations on each of its ends.
A "standard deviation" (symbols: s for samples and
σ for populations) is the square root of the "variance"
and the variance is the average of the squares of all deviations from the mean.
The mathematical formula is:
So, confidence tells us whether we can be 90%, 95% or 99% sure of an
inference drawn from statistics.
Very often, it's more important for genetic genealogists to know how
likely it is that they are wrong, than how confident they may be in their
assessments. Uncertainty is the space left by the limits of confidence. If
100% represents perfect confidence and we have attained 90% confidence, then
our uncertainty is 100% - 90% = 10%
This is especially applicable when we're dealing with small data sets, as
is typical in genetic genealogy. If
we're attempting to "triangulate" to an ancestral haplotype from a cluster
"matching" DNA, we may be dealing with as few as two
reference points and, in only a few instances, as many as 10. We may be
unknowingly encountering what
Bessel and others describe as inherent sample
The mathematical principle is that the variability of the data seen in a small sample is
less than that in the entire population.
Fortunately, Bessel's correction for sample bias gives us a basis to
assess the effect of
small sample bias on variability.
His correction formula is:
Notice that simple algebra allows us to divide one formula by the other and get this expression,
relating a biased estimate to
its unbiased counterpart
which is always > 1 for N>1 (or, for N=1, undefined)
Subtracting unity (1) from this factor yields a simple formula
to estimate the
uncertainty in our conclusions:
The behavior of the function is shown in this graph:
Notice that the value starts out large at small values of N and decreases rapidly, then the curve
the number of reference points approach 10.
We seldom find clusters of matching DNA in the larger sizes. They are often
The above formula relates primarily to uncertainty with respect to the standard deviation,
"dispersion" of the data.
However, it serves as a rough guide to uncertainty of inferences because
standard deviations are directly involved in confidence intervals.
Uncertainty in the Mode
(From: Wikipedia.) "The mode of a discrete probability distribution is the value x at which its probability
mass function takes its maximum value. In other words, it is the value that is most likely to be sampled."
it is the value which occurs most often in a set of data. For more
The mode is
usually more stable and less subject to doubt than
other measures of central
tendency. It can change only when a different value becomes most
frequent and replaces the former modal value with a new one.
Note: We use subscripts to denote the frequencies of values in order. F0
is the highest frequency value -- the
present mode. F1
is the next
highest frequency value, F2 the third highest, etc. The fraction F0/N represents
the ratio of the frequency to group size.
Example: A cluster of seven (7) individuals displays
one (1) with DYS458=16, four (4) with DYS458=17 and one () with DYS458=18
and one (1) with DYS458=19.
- F0/N = 4/7 = ~0.57
- F1/N = 1/7 = ~0.14
- F2/N = 1/7 = ~0.14
- F3/N = 1/7 = ~0.14
The mode will not be affected
unless additional sampling produces at least three (3) more with DYS458=16, 18
and no more (0) with DYS458=17, at which point, the distribution becomes
bimodal. Three (3) more would be needed for any value other than DYS458=17 to become the new mode.
Therefore, the stability of the mode is related to its frequency within the
dataset -- the ratio of its frequency (Fio of its fr) to the size of the set (N):
F/N -- and to the frequencies of "competitors" for the mode: F1 for the mode: F1/N, F2/N,
etc. Where F0 is high in relation to F1, it can
change only when F1 (or F2) becomes greater than F0.
Example continued: The F0/N (DYS458=17)= ~0.57 and the frequency F1/N ratio (DYS458=18) = ~0.14.
Assuming our sample is representative of the ancestral haplotype, the
probability of each additional matching test strengthening the mode
(i.e., F0 --> 5) is 4/7 = ~0.57 and the probability of each
of the other values weakening the mode (i.e., --> 2) is ~0.14.
The probability of the mode being strengthened is four times as great as
0.57 / 0.14 = 4.
However, the mode, like other statistics, depends on the data available; it can change with
A bit of examination will show that there are l show that there are "strong modes" and
"weak modes". A strong mode exists when F0 is much greater than than F1;
a weak mode exists when F0 is only slightly greater then F1.
Our previous example was a strong mode demonstration. Strong modes have
greater probability of retaining their values as additional data is added to
Weak mode example: A dataset has ten reference
points (N=10) with one (1) at DYS458=16, two (2) at DYS458=16, four (4) DYS458=17,
three (3) at DYS458=18 and two (2) at DYS458=19.
- F0 = 4/10
- F1 = 3/10
- F2 = 2/10
- F3 = 1/10
Now, F0, DYS458 = = 4/10 = 0.4, F1 = 3/10 =0.3 and F2
= 2/10 = 0.2.
The probabilities, as N→11, only slightly favor the existing mode
being strengthened, i.e., It is only 1.33 times more probable that F0
--> 5. than that F1 --> 4.
It is an easy task to construct a larger dataset, such that F0 yielded a yet smaller
ratio of F to N.
Thus, the mode is not completely free of uncertainty. A weak (low
frequency) mode is
prone to change when unknown data becomes known.
Approximating Modal Strength
Can we derive an indicator for the likelihood of the modal value
remaining stable as new data is added? Assuming we may generalize from our
sample, the first order of approximation would seem to be its relative
frequency, F0/N. This gives an inferred probability that one
addition to the data set would have the same value.
- As a first-order approximation, F0/N gives an inferred probability, P, that one
addition to the data set would have the same value. The probability that it
would have some other value is Q = 1-P.
- Second-, third- and subsequent-level approximations:
- F0/N - F1/N = (F0-F1)/N
gives the probability "advantage" the present mode has over its nearest
||4/1 = 4
||4/3 = 1.33
||(4-1)/7 = 3/7 = ~0.43
||(4-3)/10 = 1/10 = 0.10