Other pages & sections of our site:
[Home]  [Y-DNA]   [Contacts   [Groups]  [Haplogroups[Trees]  [Project Blog]  [Special]   [FAQ]

# Doubt & Uncertainty

Doubt, or uncertainty, is a factor to be accounted for in all genetic genealogy interpretations. We can (almost) never be sure that statements we make are correct. Mutations of DNA are random and it's from the probabilities of mutations that we derive the meanings of the results.  Uncertainty is the other side of the coin from confidence, its "evil twin".

Skepticism -- especially, of one's own conclusions -- is a healthy scientific attitude. This exploration will help us objectively quantify our skepticism.

## Confidence

"Confidence" is a well-recognized statistical concept, often expressed either in "confidence limits" or "confidence intervals". There are methods for assessing the confidence to be placed in conclusions drawn from many types of data distributions: normal, binomial, chi-square, f, Gauchy, Student t, etc.

• Confidence intervals refer to the middle part of a probability curve's area: How sure can we be that the true value of an unknown falls between A & B? For example, we know that in a normal distribution 90% of values fall within plus or minus 1.65 standard deviations of the mean (average).

• Confidence limits refer to one end of a probability curve's area: How sure can we be that A is either less than or greater than B? For example, we know that a normal distribution has only 5% of values outside 1.65 standard deviations on each of its ends.

A "standard deviation" (symbols: s for samples and σ   for populations) is the square root of the "variance" and the variance is the average of the squares of all deviations from the mean. The mathematical formula is: So, confidence tells us whether we can be 90%, 95% or 99% sure of an inference drawn from statistics.

## Uncertainty

Very often, it's more important for genetic genealogists to know how likely it is that they are wrong, than how confident they may be in their assessments. Uncertainty is the space left by the limits of confidence. If 100% represents perfect confidence and we have attained 90% confidence, then our uncertainty is 100% - 90% = 10%

This is especially applicable when we're dealing with small data sets, as is typical in genetic genealogy. If we're attempting to "triangulate" to an ancestral haplotype from a cluster of "matching" DNA, we may be dealing with as few as two reference points and, in only a few instances, as many as 10. We may be unknowingly encountering what Bessel and others describe as inherent sample bias.

The mathematical principle is that the variability of the data seen in a small sample is less than that in the entire population.

### Uncertainty Space

Fortunately, Bessel's correction for sample bias gives us a basis to assess the effect of small sample bias on variability.
His correction formula is: Notice that simple algebra allows us to divide one formula by  the other and get this expression, relating a biased estimate to its unbiased counterpart which is always > 1 for N>1 (or, for N=1, undefined)

Subtracting unity (1) from this factor yields a simple formula to estimate the uncertainty in our conclusions: The behavior of the function is shown in this graph: Notice that the value starts out large at small values of N and decreases rapidly, then the curve flattens as the number of reference points approach 10.

N ~U N ~U N ~U 2 41% 7 8% 15 4% 3 22% 8 7% 20 3% 4 15% 9 6% 25 2% 5 12% 10 5% 30 2% 6 10% 12 4% 35 1%

We seldom find clusters of matching DNA in the larger sizes. They are often <=10.

The above formula relates primarily to uncertainty with respect to the standard deviation, or "dispersion" of the data. However, it serves as a rough guide to uncertainty of inferences because standard deviations are directly involved in confidence intervals.

### Uncertainty in the Mode

Definition: (From: Wikipedia.) "The mode of a discrete probability distribution is the value x at which its probability mass function takes its maximum value. In other words, it is the value that is most likely to be sampled." More simply, it is the value which occurs most often in a set of data. For more discussion, click here.

The mode is usually more stable and less subject to doubt than other measures of central tendency. It can change only when a different value becomes most frequent and replaces the former modal value with a new one.

Note: We use subscripts to denote the frequencies of values in order. F0 is the highest frequency value -- the present mode.  F1 is the next highest frequency value, F2 the third highest, etc. The fraction F0/N represents the ratio of the frequency to group size.
Example: A cluster of seven (7) individuals displays one (1) with DYS458=16, four (4) with DYS458=17 and one () with DYS458=18 and one (1) with DYS458=19. F0/N = 4/7 = ~0.57     F1/N = 1/7 = ~0.14    F2/N = 1/7 = ~0.14    F3/N = 1/7 = ~0.14

The mode will not be affected unless additional sampling produces at least three (3) more with DYS458=16, 18 or 19 and no more (0) with DYS458=17, at which point, the distribution becomes bimodal. Three (3) more would be needed for any value other than  DYS458=17 to become the new mode.

Therefore, the stability of the mode is related to its frequency within the dataset -- the ratio of its frequency (Fio of its fr) to the size of the set (N): F/N -- and to  the frequencies of "competitors" for the mode: F1 for the mode: F1/N, F2/N, etc. Where F0 is high in relation to F1, it can change only when F1 (or F2) becomes greater than F0.

Example continued: The F0/N (DYS458=17)= ~0.57 and the frequency F1/N ratio (DYS458=18) = ~0.14. Assuming our sample is representative of the ancestral haplotype, the probability of each additional matching test strengthening the mode (i.e., F0 --> 5) is 4/7 = ~0.57 and the probability of each of the other values weakening the mode (i.e.,  --> 2) is  ~0.14. The probability of the mode being strengthened is four times as great as being weakened, 0.57 / 0.14 = 4.

However, the mode, like other statistics, depends on the data available; it can change with additional information. A bit of examination will show that there are l show that there are "strong modes" and "weak modes". A strong mode exists when F0 is much greater than  than F1; a weak mode exists when F0 is only slightly greater then F1. Our previous example was a strong mode demonstration. Strong modes have greater probability of retaining their values as additional data is added to their groups.

Weak mode example: A dataset has ten reference points (N=10) with one (1) at DYS458=16, two (2) at DYS458=16, four (4) DYS458=17, three (3) at DYS458=18 and two (2) at DYS458=19. F0 = 4/10 F1 = 3/10 F2 = 2/10 F3 = 1/10

Now, F0, DYS458 = = 4/10 = 0.4, F1 = 3/10 =0.3 and F2 = 2/10 = 0.2. The probabilities, as N→11, only slightly favor the existing mode being strengthened, i.e., It is only 1.33 times more probable that  F0 --> 5. than that F1 --> 4.

It is an easy task to construct a larger dataset, such that F0 yielded a yet smaller ratio of F to N.

Thus, the mode is not completely free of uncertainty. A weak (low frequency) mode is prone to change when unknown data becomes known.

### Approximating Modal Strength

Can we derive an indicator for the likelihood of the modal value remaining stable as new data is added? Assuming we may generalize from our sample, the first order of approximation would seem to be its relative frequency, F0/N. This gives an inferred probability that one addition to the data set would have the same value.

• As a first-order approximation, F0/N gives an inferred probability, P, that one addition to the data set would have the same value. The probability that it would have some other value is Q = 1-P.

• Second-, third- and subsequent-level approximations:
• F0/N - F1/N = (F0-F1)/N gives the probability "advantage" the present mode has over its nearest competitor.
Order Statistic Strong Mode
Example
Weak Mode
Example
1st F0/F1 4/1 = 4 4/3 = 1.33
2nd (F0-F1)/N (4-1)/7 = 3/7 = ~0.43 (4-3)/10 = 1/10 = 0.10

Updated to 12 Oct 2015