[an error occurred while processing this directive]

# Match Interpretation: Theory & Math

The theory and math underlying the interpretation of Y-DNA matches are, simultaneously, both simple and complex.

The theory is relatively simple; the hardest part is starting with the right questions. The math gets complex quickly because we have many variables to be accounted for.

## Theory

The theory is the probability of multiple independent events (PMIE), combined with the probability of multiple dependent events (PMDE).

• Each time Y-DNA is passed from a father to a son, each marker has an opportunity to change value independently from what's happening with any other marker; these are multiple independent events.  Whether a marker changes or not is a random event described by the rules of PMIE.

• However, each transmission event is affected by the outcomes of prior events. These are dependent events described by the rules of PMDE.

### Fundamental Questions

The hardest part about the theory is picking the right end of the problem to begin with. What are the fundamental questions we need to ask.

1. What are the chances of no mutation in a marker?
2. What are the chances of no mutations in any of many markers?
3. What is the probability that the observed match between two sets of Y-DNA are not due to chance?
4. What is the probability of a common male ancestor within a certain time period?

### Caveat

There are such things as "coincidental matches", which tend to occur most often among the more common haplotypes. The patterns of marker/allele values may be similar, but they don't spring from a common source. The similarity is more likely due to "convergent evolution", in which organisms of different ancestral heritage evolve to like forms.

For example, you might see that you have hundreds of reported close matches at 37 markers. Not only are these too many for follow up, you are probably unrelated in genealogic time to most of them.

In such instances, SNP testing is recommended to eliminate matches of obviously different DNA inheritance. SNP testing in some depth can tell you who you are less related to.

### Definitions

Transmission event
The passing of Y-DNA from a father to a son, subject to uncertainty.
Sample space
The set of all possible outcomes of one or more transmission events.
Outcome
Whether or not a marker changes.

Whenever we look at Y-DNA that differs, we need to compare it to the probability of it not changing. The probability of an something not happening is the complement of its happening, e.g., 1 - "happening". Statisticians use "p(x)" to indicate the probability of an event and "q(x)" for the non-event's probability.

The sum of the probabilities of all possible events is always 1 (or 100%). The basic formulas are:

• p(x) = the probability of outcome x
• q(x) = the probability of all other outcomes
• p(x) + q(x) = 1,   p = 1 - q,   q = 1 - p

## Real World

In the project, we observe that 67/67 matches are rare, even in matches where both donors have excellent documentation of the CMA. 37/37 a little less so; 25/25 less rare still. And 12/12 matches are common.

Intuitively, this suggests that the chances of Y-DNA remaining completely unchanged across many markers through multiple transmission events are small.

## Analogies

Some tangible examples whose probabilities mirror those we want to consider.

### Buckets of balls

Picture 12 to 67 buckets, each containing 250 to 400 colored balls;  in each bucket, all the balls -- except a red one -- are the same color, blue. We'll go through the buckets taking one ball out of each bucket without looking. If it's not a red ball, we'll replace it; if it is red, we replace the bucket with the same number of balls, but change the colors to one green and the rest red.

Then we repeat the trial, again and again. Every time we draw an "odd ball", we change the bucket with balls of different colors -- yellow, purple, black, white, tangerine, striped, etc.

What this illustrates is that:

• The probability of each picking from a bucket (mutation) is independent of the others;
• But each trial (set of buckets) depends on the outcomes of the prior trials.

### Rolling Dice

The math is the same as that used to predict the odds of rolling dice, where you can tell each die from the other.

Years ago, I needed to generate random numbers for sampling purposes. To get the most random sets of numbers possible, I used a set of 10-sided dice whose sides were numbered 0 to 9.

But these are some strange dice; some have 250 sides and some have as many as 400 sides. For each die, one side says "change" and all the other sides say "same".

We'll roll 12 to 67 dice at a time, read the top side & record the results. Then we'll roll again and again.

## The Math

The math quickly gets complex for these reasons.

• Many variables: We are considering a minimum of 12 markers (or dice) and up to 67, each one of which is subject to possible change.
• Small frequencies: The probability of no change (same) for each marker is much greater than that of change. (Indicated by only one side of the 250-400 labeled "same".)
• Multiple transmission events: We'll be rolling the dice many times. The statistical term for this is "trials". Each trial is dependent on the outcomes of previous trials.

### Simplified Example

Let's consider three markers named "A", "B" & "C".  (Three markers is simpler than our real problem.) For not changing, we'll call them "A0", "B0" & "C0"; for changes, we'll call them "Ac", "Bc" & "Cc". In other words, Ac is the same marker as  A0, but changed in value. To state it mathematically,

• p(A0) = q(Ac) = 1 - p(Ac)
• p(B0) = q(Bc) = 1 - p(Bc)
• p(C0) = q(Cc) = 1 - p(Cc)

Some hypothetical values:

• Let p(Ac) = 1/250 = 0.004, p(Bc) = 1/400 = 0.0025 & p(Cc) = 1/300   0.0033 (In the approximate range of mutation frequencies.)
• With the probabilities we assigned,
p(A0) = 1 - 0.0040 = 0.9960 = 99.60%,
p(B>) = 1 - 0.0025 = 0.9975 = 99.75% &
p(Cub>) = 1 - 0.0033 = 0.9967 = 99.67%.

### Basic formulas

In these formulas,

#### The probability of either A, B or C (any one or more) changing in one trial (transmission event) is:

p(Ac or Bc or Cc) = p(Ac) + p(Bc) + p(Cc) -- the probabilities of each added together
p(Ac, or Bc, or Cc) = 0.004 + 0.0025 + 0.0033 = 0.0098 = 0.98%

#### The probability of one, but not more than one, changing in 1 TE is

p(Ac Xor Bc Xor Cc) = p(Ac) + p(Bc) + p(Cc) - p(Ac&Bc) - p(Ac&Cc ) - p(Bc&Cc)
= p(Ac) + p(Bc) + p(Cc) - { p(Ac&Bc) + p(Ac&Cc ) + p(Bc&Cc) }
(We won't continue this further.)

#### The probability of all of them A, B and C changing in 1 TE is:

p(Ac & Bc & Cc) = p(Ac) * p(Bc) * p(Cc) -- the probabilities of each multiplied  together
p(Ac, & Bc, & Cc) = 0.004 * 0.0025 * 0.0033 ≈  3.33 ∕ 100,000,000 = (3.33*10-8)

#### Especially important, the probability of none of them changing in 1 TE:

p(A0 & B0 & C0) = q(Ac or Bc or Cc) = 1 - p(Ac or Bc or Cc) =
= 1 - { p(Ac) + p(Bc) + p(Cc) }

= 1 - {0.004+ 0.0025  + 0.0033} = 1 - 0.098 = 0.9902 = 99.02%

### Two or More Transmission Events

For2 TE (opportunity to change) or more, the probability of no change in any trial is given by the formula for multiple dependent events -- approximated by a binomial distribution whose formula is where k = number of markers and n = number of TE.
The exclamation point (!) indicates a factorial term:
2! = 2*1 = 2, 3! = 3*2*1 = 6, 4! = 4*3*2*1 = 24, etc..

Here,
we have to consider the odds of no change happening during any of the trials. From our formula q = 1-p, p(TE=2) = 1 - q(TE=2)

For two TE:

• q(TE=1,2) = 1 - p(TE=1,2) = 1 - [p(TE=1)*p(TE=2), but p(TE=1) = p(TE=2)
• p(TE=1) * p(TE=2) = p(TE=1)2 and q(TE=2) = 1 - p(TE=1)2
• Substituting values:
• p(A0,TE=2) = 1 -p(Ac)2 =

We can similarly demonstrate that

• q(TE=1,2,..i) = 1 - p(TE=1,2,..i) = 1 - p(TE=1)i

## Binomial Distribution

The binomial distribution formula approximates the probabilities of mutations in a ySTR haplotype as passed down through the generations. ## Gamma Distribution

Gamma distributions are used as a model to calculate TMRCA (time to most recent common ancestor) probabilities. See this page on gammas.

## Bayes Theorem

Bayesian (conditional) processes can be used to narrow your search. The Bayes Theorem holds that additional information can modify the expected probabilities: Bayes Rule: P = The Bayes process here is used to eliminate probabilities known to be impossible by virtue of documentation. The probability of impossible things is zero, p = 0. For example, assume that you know this much about the CMA for donors A & B:

• He can not be less than 6 TE back in donor A's line, and
• He can not be less than 7 TE back in donor B's line.

This additional genealogical information allows you to reassign the probabilities for <=12 TE to TE >=13. One may think of this as eliminating TE 1 to 12 from the cumulative probability graph and counting as though 1 was 13, 2 was 14, etc.