[an error occurred while processing this directive]
# Match Interpretation: Theory & Math

The theory and math underlying the interpretation of Y-DNA matches are, simultaneously,
both simple and complex.

The theory is relatively simple; the hardest part is starting with the
right questions. The math
gets complex quickly because we have many variables to be accounted for.

### Incomplete — Work in progress

The theory is the probability of multiple independent events (PMIE), combined
with the probability of multiple dependent events (PMDE).

- Each time Y-DNA is passed from a father to a son, each marker has an opportunity to
change value independently from what's happening with any other marker;
these are multiple independent events.
Whether a marker changes or not is a random event described by the rules of
PMIE.

- However, each transmission event is affected by the outcomes of prior
events. These are dependent events described by the rules of PMDE.

### Fundamental Questions

The hardest part about the theory is picking the right end of the problem
to begin with. What are the fundamental questions we need to ask.

- What are the chances of
**no **mutation in a marker?
- What are the chances of
**no** mutations in **any** of many
markers?
- What is the probability that the observed match between two sets of
Y-DNA are
**not** due to chance?
- What is the probability of a common male ancestor within a certain time
period?

### Caveat

There are such things as **
"coincidental matches"**, which tend to occur most often among the
more common haplotypes. The patterns of marker/allele values may be similar,
but they don't spring from a common source. The similarity is more likely
due to
"convergent evolution", in which organisms of different ancestral
heritage evolve to like forms.

For example, you might see that you have hundreds of reported close matches
at 37 markers. Not only are these too many for follow up, you are probably
unrelated in genealogic time to most of them.

In such instances, SNP testing is recommended to eliminate matches of
obviously different DNA inheritance. SNP testing in some depth can tell you who
you are less related to.

### Definitions

- Transmission event
- The passing of Y-DNA from a father to a son,
subject to uncertainty.
- Sample space
- The set of all possible outcomes of one or more transmission
events.
- Outcome
- Whether or not a marker changes.

Whenever we look at Y-DNA that differs, we need to compare it to the
probability of it not changing. The probability of an
something **not** happening is the complement of its
happening, e.g., 1 - "happening". Statisticians use "p(x)" to
indicate the probability of an event and "q(x)" for the non-event's
probability.

The sum of the probabilities of all possible events is always 1 (or 100%).
The basic formulas are:

- p(x) = the probability of outcome x
- q(x) = the probability of all other outcomes
**p(x) + q(x) = 1, p = 1 - q, q = 1 - p**

In the project, we observe that 67/67 matches are rare, even
in matches where both donors have excellent documentation of the CMA. 37/37
a little less so; 25/25 less rare still. And 12/12 matches are common.

Intuitively, this suggests that the chances of Y-DNA remaining completely unchanged across many markers through multiple transmission events are small.

Some tangible examples whose probabilities mirror those we want to consider.

### Buckets of balls

Picture 12 to 67 buckets, each containing 250 to 400 colored balls;
in each bucket, all the balls -- except a red one -- are the same color,
blue. We'll go through the buckets taking one ball out of each bucket
without looking. If it's not a red ball, we'll replace it; if it is red, we
replace the bucket with the same number of balls, but change the colors to
one green and the rest red.

Then we repeat the trial, again and again. Every time we draw an "odd ball", we
change
the bucket with balls of different colors -- yellow, purple, black, white,
tangerine, striped, etc.

What this illustrates is that:

- The probability of each picking from a bucket (mutation) is independent of the others;
- But each trial (set of buckets) depends on the outcomes of the prior trials.

### Rolling Dice

The math is the same as that used to predict the odds of rolling dice,
where you can tell each die from the other.

Years ago, I needed to generate random numbers for sampling purposes.
To get the most random sets of numbers possible, I used a set of 10-sided dice whose sides were numbered 0 to 9.

But these are some strange dice;
some have 250 sides and some have as many as 400 sides. For each die, one side says
"change" and all the other sides say "same".

We'll roll 12 to 67 dice at a time, read the top side & record the results.
Then we'll roll again and again.

The math quickly gets complex for these reasons.

**Many variables:** We are considering a minimum of 12 markers (or
dice) and up
to 67, each one of which is subject to possible change.
**Small frequencies:** The probability of no change (same) for each
marker is much greater than that of change. (Indicated by only one side of
the 250-400 labeled "same".)
**Multiple transmission events:** We'll be rolling the dice many
times. The statistical term for this is "trials". Each trial is dependent on
the outcomes of previous trials.

### Simplified Example

Let's consider three markers named "A", "B" & "C".
(Three markers is simpler than our real problem.)

For not changing, we'll call them "A_{0}", "B_{0}"
& "C_{0}"; for changes, we'll call them "A_{c}",
"B_{c}" & "C_{c}". In other words,
A_{c} is the same marker as A_{0}, but changed in value.
To state it mathematically,

**p(A**_{0}) = q(A_{c}) = 1 - p(A_{c})
**p(B**_{0}) = q(B_{c}) = 1 - p(B_{c})
**p(C**_{0}) = q(C_{c}) = 1 - p(C_{c})

Some hypothetical values:

- Let p(A
_{c})
= 1/250 = 0.004, p(B_{c}) = 1/400 = 0.0025
& p(C_{c}) =
1/300 ≈ 0.0033 (In the approximate
range of mutation frequencies.)
- With the probabilities we assigned,

p(A_{0}) = 1 - 0.0040 = 0.9960 = 99.60%,

p(B>) = 1 - 0.0025 = 0.9975
= 99.75% &

p(Cub>) = 1 - 0.0033 = 0.9967 = 99.67%.

### Basic formulas

In these formulas,

#### The probability of either A, B or C (any one or more) changing in one
trial (transmission event) is:

**p(A**_{c} or B_{c} or C_{c}) = p(A_{c})
+ p(B_{c}) + p(C_{c}) -- the probabilities of each
added together

p(A_{c}, or B_{c}, or C_{c}) = 0.004 +
0.0025 + 0.0033 =
0.0098 = 0.98%

#### The probability of one, but not more than one, changing
in 1 TE is

**p(A**_{c} Xor B_{c} Xor C_{c}) = p(A_{c}) +
p(B_{c}) + p(C_{c}) - p(A_{c}&B_{c}) -
p(A_{c}&C_{c} ) - p(B_{c}&C_{c})

= p(A_{c}) +
p(B_{c}) + p(C_{c}) - { p(A_{c}&B_{c})
+ p(A_{c}&C_{c} ) + p(B_{c}&C_{c}) }
(We won't continue this further.)

#### The probability of all of them A, B and C changing in 1 TE is:

**p(A**_{c} & B_{c} & C_{c}) = p(A_{c})
* p(B_{c}) * p(C_{c}) -- the probabilities of each
multiplied together

p(A_{c}, & B_{c}, & C_{c}) = 0.004 *
0.0025 * 0.0033 ≈ 3.33 ∕ 100,000,000 =
(3.33*10^{-8})

#### Especially important, the probability of none of them changing in
1 TE:

**p(A**_{0 }& B_{0 }& C_{0}) =
q(**A**_{c} or B_{c} or C_{c}**) = 1 - p(A**_{c} or B_{c} or C_{c}) =

= 1 - { p(A_{c})
+ p(B_{c}) + p(C_{c}) }

=
1 - {0.004+ 0.0025 + 0.0033} = 1 - 0.098 = **0.9902 = 99.02%**

### Two or More Transmission Events

For2 TE (opportunity to change) or more, the probability of no
change in any trial is
given by the formula for multiple dependent events -- approximated by a binomial distribution
whose formula is

where k = number of markers and n = number of TE.

The exclamation point (!) indicates a factorial term:

2! = 2*1 = 2,
3! = 3*2*1 = 6, 4! = 4*3*2*1 = 24, etc..

Here,

we have to
consider the odds of no change happening during any of the trials. From
our formula q = 1-p,
p(TE=2) = 1 - q(TE=2)

For two TE:

- q(TE=1,2) = 1 - p(TE=1,2) = 1 - [p(TE=1)*p(TE=2), but p(TE=1) = p(TE=2)
- p(TE=1) * p(TE=2) = p(TE=1)
^{2} and q(TE=2) = 1 - p(TE=1)^{2}
- Substituting values:
- p(A
_{0},TE=2) = 1 -p(A_{c})^{2} =

We can similarly demonstrate that

- q(TE=1,2,..i) = 1 - p(TE=1,2,..i) = 1 - p(TE=1)
^{i}

The binomial distribution formula approximates the probabilities of
mutations in a ySTR haplotype as passed down through the generations.

## Gamma Distribution

Gamma distributions are used as a model to calculate TMRCA (time to most
recent common ancestor) probabilities. **
See this page on gammas**.

Bayesian (conditional) processes can be used to narrow your search.
The Bayes Theorem holds that additional information can modify the expected
probabilities:

Bayes Rule: P
=

The Bayes process here is used to eliminate probabilities known to be
impossible by virtue of documentation. The probability of impossible things is
zero, p = 0. For example, assume that you know this much about the CMA for donors A & B:

- He can not be less than 6 TE back in donor A's line, and
- He can not be less than 7 TE back in donor B's line.

This additional genealogical information allows you to reassign the probabilities for <=12 TE
to TE >=13. One may think of this as eliminating TE 1 to 12 from the cumulative
probability graph and counting as though 1 was 13, 2 was 14, etc.