YDNA PROJECTS: Project size vs.
Unmatched Participants
– an exploration for project administrators
by Ralph Taylor
It's
sometimes said: "For any YDNA project, the number of unmatched
participants will go down as the size of the project increases." A more
sophisticated statement, perhaps, would be ".. the proportion of
unmatched participants will go down as the size of the project increases."
Either version seems intuitive, but may not be universally true and the
applicability is of concern to project administrators.
How
does project size relate to the chances of new participants matching or not matching existing
participants?
Hardly
any historical or scientific research attention has been paid to the subject;
there is little definitive or authoritative guidance; we project administrators
are on our own. It is, however, a question worth examining more objectively.
{Note: Throughout, there will be superscript numbers, in blue &
underlined, like this[00]
. They indicate end notes for sources, explanations, etc. Each is a link to its separate
note. due to reorganization of the material, order of the notes is not
necessarily reflective of place in the document.}
We'd like
to think that each new participant without a match puts us a step closer to
everyone matching; but does it? And if so, how big is that step? Our concern, I
think, relates to unmatched participants representing unrealized promises from
YDNA testing  that it will aid them in their family history research. We’d
like to see all our participants find a match and identify their CMAs (common
male ancestors).
I've
given some thought to the questions, on behalf of our ~60% of Taylor project
participants without matches. This writing will attempt to present a way of
looking at the situation which may help project administrators to arrive at
answers.
Note: As of July 2015, the percentage of project members
without matches in the project has declined to ~46%.
Project variety
The variety of YDNA projects is large; considering there are more than 7,000
of them, it isn't surprising. There are surname projects, geographical area
projects and haplogroup projects (which are not addressed here). Within surname
projects, too, there is a wide variety; some concern rare names of singlepoint
origin; others focus on common names of multiple origins. Some projects (e.g.,
for Scots clans) include multiple surnames.
We are looking for a model which is useful to administrators of as broad a
range of size and focus as possible.
To begin:
Let's
start our examination with two concepts:
 The number of
direct ancestral paternal lines[1]
(AKA, "patrilines") for the surname,
region, etc.  the "population". (Think of the total number of common ancestors
at some past time.) Let's designate this number by the letter "A" to represent the ancestors who gave rise to
all us descendants.
and
 The
degree to which the sample (project participants) represents the variety of
ancestral lines within the project target population. Let’s call this
represented array
“Found lines” and designate it by the letter “F”.
The
Euler’s Circle diagram below illustrates the concepts in combination. Found
lines are a part of – but not all of  the patrilines for the project’s
target population.
Conceptual Diagram

In this diagram, the Found Circle is small in relation to the
Ancestral circle. The number of Ancestral lines greatly exceeds Found lines. 
The ratio of Found Lines to Ancestral Lines (F/A
or F:A) is important; it
 Establishes the probability that a randomlyselected new participant will match one or more existing
participants, and
 Indicates the degree to which the project has surveyed the target
population's YDNA.
The remainder of this exploration continues with a more detailed examination.
FOUND LINES (F):
We'll start with the simplertodetermine "Found lines". Here, we depart from
other approaches. What we mean is NOT the total participation in the
project. Nor is it “penetration” relative to population as described by James
Irvine In contrast to ancestral lines, the number of found lines is easy to
determine within the project.
“Found
lines” is a reasonably simple calculation, the number of unmatched
participants plus the number of DNAmatched groups or "clusters" of participants. We can call this
number "F".
Each
singleton represents a line and each group[20] or
cluster (all matching participants in the group taken together) represent one
line for each group.
Note: Surname projects must make adjustments for
 NPE  Some (perhaps 20%30%) of singletons will be the result of
"not the parent expected" events. Depending on the age of the event, they
may not represent ancient patrilines.
 Nonsurname paternity  Some who know their patriline is of
another surname may join the project  e.g., for mtDNA, autosomal DNA
(Indirect relation) or other reasons
The S component in the formula below should be
reduced to account for these.
If we let S stand for the number of unmatched
singletons[27]
and G stand for the number of matched groups or clusters,
then:
F = S + G.
Found lines (F) is a number which project administrators may influence over
time. Ancestral patrilines (A), to be discussed below, is beyond project control; it
is what is is, based on past events, and can not be affected by project
administrators.
Possible relationships between S & G are beyond this present exploration's scope. However,
it may be found here.
The larger
F is in relation to A (F /A ratio)
the more likely a new participant will match another in the project.
Sampling problems:
To face
reality, our projects’ samples are not randomized and not necessarily
representative.
 The USA is
greatly overrepresented and regions of origin are greatly underrepresented.
 Project participants are older than the general population.
 More affluent people are overrepresented compared to less affluent  a social class bias.
 There is some incentive for those with paper trails indicating a common male
ancestor to test in order to
confirm the paper.
 Some projects proactively recruit participants with known pedigrees.
 Countervailing
these is incentive is for those with poor paper trails (or NPE) to use testing to focus
further research or as a substitute for documentary research.
 Many social,
cultural and emotional factors influence individual decisions to test or not to
test.
The
sampling problems are perhaps insurmountable. We may not be able to achieve true
random or representative sampling. In the interest of getting to an
approximate probability model, however, we'll leave it to others to assess how
this reality affects our sampling and proceed as best we can.
It should
be noted that, for most projects, A > F until all ancestral lines are found.
Every new participant will either match one or more existing participants or add
to the number of found lines.
Example:
The Taylor project has identified ~240 separate & unique found lines.
 S=201, G=39 [21]
 F = S + G = 201 + 39 = 240
Update, July 2015
 S=299, G=87
 F = (SNTP)*(10.3) +87 = (29975)*(0.7)+87 = 244
ESTIMATING MATCH (& UNMATCH) PROBABILITY  THE F/A RATIO:
The likelihood of a new participant matching any other participant or participants
is the ratio of found lines to ancestral patrilines (F:A
or F/A), which establishes the expected probability. The
closer this number is to 1, the greater the probability of matching
Pr(match) = F/A.
More below the discussion of Ancestral Lines (A).
What we
mean by “ancestral lines” (A) is that
number of paternal ancestors which can not be reduced further by finding
intersections of lines of descent (“nodes” or common ancestors) within a
genealogical time frame. This term is roughly congruous with “haplotypes” [2].
Imagine that one could identify every original ancestor for the target
population who has living direct filial male descendants.
 For an area
project, A would be the number of adult males with modern descendants at the
later of the time of settlement of the area or beginning of the genealogical
time frame.
 For a surname
project, this would be the number of male ancestors at the beginning of the
genealogical time frame whose modern descendants bear the surname.
One way
to picture the concept is by comparison with the “Founder Effect”, except that
these founders do not necessarily leave the larger population in which they’re
embedded.
Another
way we might picture this is by reversing our usual perspective: Imagine
pedigrees  back to the beginning of genealogical time  for every
person in a project's subject group. The nodes, where pedigrees intersect each
other are Most Recent Common Ancestors
(MRCA). But, some pedigrees may not intersect others at all; they
represent independent lines of descendancy. This will be especially true for
occupational surnames and some regional projects.
The
number of ancestral lines for any particular surname or region has received
very little research attention. I'd argue, however, that it is vitally important
to YDNA project administrators, in order that they understand the nature of the
task before them. It sets out the size of the task.
Axioms of YDNA projects
All YDNA projects share the same axioms:
 That Ychromosome
haplotypes point to a unique individual ancestor, the “founder” of each YDNA
haplotype (ancestral patriline).
 That YDNA is
transmitted from father to son almost without change.
[29]
 That YDNA remains relatively stable over many successive transmission
events, through many fathers to sons, grandsons, greatgrandsons, etc.
 That surnames in in many cultures are also passed down father to son,
without appreciable change.
[30]
From these axioms come corollaries:
 That, at the
beginning of a genealogical time frame
[3], there
existed a finite number of common male ancestors founding each unique YDNA
pattern.
 That this number
is – at least sometimes – knowable or subject to estimation.
The number of ancestral lines, A, may be small or large – ranging
from a handful to thousands – and it can be difficult to estimate. Presumably,
the more common & multioriginated the surname, the more lines will exist in the
population and have existed in the past. In some cases, the number may be
unknowable to an acceptable precision; however, it is always finite, greater
than zero and its true value is independent of a project’s DNA results or
sampling.
Single ancestral
patriline

Three ancestral patrilines 
In the
diagrams above, the red circles indicate ancestors; the green circles represent
living descendants for YDNA testing.
Factors affecting A, the number of ancestral
patrilines:
The
number of lines which may have existed at some past time is not necessarily the
number with living descendants. The number may have decreased or increased since
then:
 Nonparental
events (NPEs)  adoptions, name changes, etc.  will both add to and subtract
from this number. Whether the additions & subtractions cancel each other or are
negligible in overall effect probably varies with the surname or region.
 Some paternal
lines will have died out
[4]
during the Genealogical Time Frame. They will have no living descendants for testing.
A generation may have died without issue or have had only daughters, thus
breaking the direct paternal descendancy.
 A reviewer notes that there may be a significant difference between the
number of living descendants per line in the United States and in the country of
origin. For her comment, see the end note.
[28]
How many ancestral patrilines?
Depending
on the nature of the project, A may
range from one (single origin) through a handful (plural origin) to thousands (multiorigin).
Large geographical projects of
recent population growth (e.g., “Southern California”) or common surnames of
multipoint origin (e.g., “Smith”) will have a greater A than raresurname projects. As we shall see
later, this has significant implications for project administration.
METHODS FOR DETERMINING
ANCESTRAL LINES:
Project
administrators will find little useful guidance for estimating the number of
ancestral patrilines in genetic or genealogy literature. The subject has, seemingly,
been considered unworthy of serious attention.
We will
describe five separate approaches for estimating A and give an example for each:
 Population Growth model;
 Surname Frequency model;
 Historical analysis;
 Tree model and
 A posteriori use
of project findings.
Some of these
methods may be most useful for surname projects; some for geographical projects. All require many simplifying
assumptions and are subject to error.
[34]
Results of multiple approaches may be assessed to suggest the most reasonable estimate.
The primary purpose is to illustrate that such estimates are possible and to
start discussion about how they may be done.
The population growth approach consists of regressing growth back to an earlier (&
typically lower) population. One starts from the assumption that population growth
(or decline) for a particular surname or region matches that of the population
as a whole. This allows one to work from an estimated number of patrilines at one
point in time to an estimate for an earlier point in time by means of compound
growth formulas.
Here are
two graphs of the estimated British population from 410 (after the Roman era)
through 1999.
[5]
The left
side shows population in millions with percent change. (The Black Death change is off the scale.) The right side shows the logarithmic values for
population.
Estimated Population & Percent Change 
Estimated Population (Logarithmic) 
Note that
the logarithmic curve from after the Black Death[6]
to modern times is very roughly straight, with a positive slope. Annual
population growth per year in this 650year period ranged from a low of 0.36% to
a high of 2.15%, with a mean of 0.91% & standard deviation of 0.72%.
The
mathematics resemble a compound interest problem. Let A_{G} = number of
households with surname at an earlier date, D_{1}, and H = households
number at a later date, D_{2}, with growth rate G; then,
H ≈ A_{G} * (1+G)^(D_{2}
– D_{1}) therefore
A_{G} ≈ H / (1+G)^(D_{2} –
D_{1})
Simplifying assumptions:
 That each
household contains one and no more than one CMA, the father;
 That number of
persons per household remains relatively constant over the period of
examination;
Example:
Let us start with the 22,287 Taylor households reported[7]
in the 1851 English census and work backwards to an estimate for 1355[8].
A_{G} ≈ 22287 * (1.0091)^(18511355) = 22287 * 1.0091^(496)
= 22287 * 89.406
A_{G}
≈ 249, with an upper 90% confidence bound of 693[9].
Comments on example:
 It is has been ~660 years since 1355; for 249 Taylors then to produce the
present ~1,560,000 would require a constant, compounded growth rate of 1.37% per
year. This makes no allowance for any lines dying out during >6 centuries.
 A_{G} ≈ 249 is just slightly more than the number of “found
lines” (to be discussed below) in the project. The “acid test” of the validity
of this estimate will be if the project fails to add more singletons as it grows
in size.
[31]
Another Example:
For another example based on the population growth method see
"How many Whitneys?"
by Tim Doyle.
The frequency
[10] of a
surname in modern times may be a rough indicator relative to other surnames.
More or less frequency of a surname is in recent times, may suggest its relative
prevalence in earlier times.[11]
 Smith is the most
common surname in the US & UK, carried by ~1% of people (881/1000 in US), and it
probably has more lines than Atkins, the 500th (0.03%), or Archer, the 1019th
(~0.02%).
 Taylor[12] ranks 13^{th} in the US,
~0.31% or ~720,000 people, and about 4^{th} in England (~306,000).
 Bradbury (for my
friend, Brent:) ranks 4137^{th}, ~ 0.003%, & Bradberry 7427th, ~0.002%
in the US. UK = Bradbury: 519^{th}, 14,000 people & Bradberry: 192^{nd},
~22,000.
 Irvine (& variants: Irwin, Erwin, Irvin,
Irving, McIrvin & Irvan) rank from 939^{th} to 51227^{th},
totaling about 0.04% of the 2000 US population. It has been described as a
surname with a small A.
The 20 most common surnames in
the United States.
[13]
Rank 
Name 
Count 
Per 100k 
1 
SMITH 
2,376,206 
880.85 
2 
JOHNSON 
1,857,160 
688.44 
3 
WILLIAMS 
1,534,042 
568.66 
4 
BROWN 
1,380,145 
511.62 
5 
JONES 
1,362,755 
505.17 
6 
MILLER 
1,127,803 
418.07 
7 
DAVIS 
1,072,335 
397.51 
8 
GARCIA 
858,289 
318.17 
9 
RODRIGUEZ 
804,240 
298.13 
10 
WILSON 
783,051 
290.27 
11 
MARTINEZ 
775,072 
287.32 
12 
ANDERSON 
762,394 
282.62 
13 
TAYLOR 
720,370 
267.04 
14 
THOMAS 
710,696 
263.45 
15 
HERNANDEZ 
706,372 
261.85 
16 
MOORE 
698,671 
259.00 
17 
MARTIN 
672,711 
249.37 
18 
JACKSON 
666,125 
246.93 
19 
THOMPSON 
644,368 
238.87 
20 
WHITE 
639,515 
237.07 


The 20
most frequent surnames[14] in
the
United Kingdom:
Rank 
Surname 
Count 
Per 100k 
1 
SMITH 
652,563 
111.00 
2 
JONES 
538,874 
91.66 
3 
WILLIAMS 
380,379 
64.70 
4 
TAYLOR 
306,296 
52.10 
5 
BROWN 
291,872 
49.65 
6 
DAVIES 
279,647 
47.57 
7 
EVANS 
222,580 
37.86 
8 
THOMAS 
202,773 
34.49 
9 
WILSON 
201,224 
34.23 
10 
JOHNSON 
193,260 
32.87 
11 
ROBERTS 
187,871 
31.96 
12 
ROBINSON 
165,193 
28.10 
13 
THOMPSON 
162,920 
27.71 
14 
WRIGHT 
161,391 
27.45 
15 
WALKER 
155,734 
26.49 
16 
WHITE 
154,147 
26.22 
17 
EDWARDS 
153,284 
26.07 
18 
HUGHES 
151,269 
25.73 
19 
GREEN 
145,856 
24.81 
20 
HALL 
145,231 
24.70 

If it is
assumed that frequency in the present population is indicative of frequency in a
distant past, the present frequency may be multiplied by the total population in
the past to estimate the number of individuals at that time. One may
“daisychain” an estimate of household size to calculate A at the time of
interest.
[32]
Let Fr =
frequency in persons per 100,000 in modern times, T = target population at the
time of interest, and
P_{ }= total population at time of interest, then
T = P * (Fr / 10^5)
And for
the daisychain, let H_{P} = number of persons per household, then:
A_{F} ≈ T /_{ }H_{P }
≈ P * (Fr / 10^5) / H_{P}
≈ P * Fr / (H_{P} * 10^5)
Simplifying assumptions:
 That frequency of
a surname in the population is the same at the time of interest as for the time
for which data is available. This assumption may not hold if an intervening
period of high immigration has been experienced.
 That persons per
household is the same at the time of interest as for the time for which data is
available. This assumption may not hold if childbearing practices have changed
significantly.
 That each
household contains the father. This assumption may not hold immediately after
widespread wars.
 That household fathers do not share a common male ancestor. This may not
hold for certain surnames.
Example:
52.1 per 100,000 British subjects bore the Taylor surname in 2001. The population of
Britain in 2001 was ~58,790,000 (587.9*10^5) and in the late 14^{th}
century about 2,250,000 (2.2510^5). Assuming H_{P}
≈ 5.0, then
T
≈ (2,250.000) * 52.1 / 100,000 ≈ 1,172
A_{F}
≈ 1,172 / 5.0 ≈ 234.
Comments on example:
 A_{F} ≈ 234 is less than the number of “found
lines” (discussed above) in the project, raising doubts as to the validity
of the estimate.
 For 234 Taylors in 1355 to have produced the present ~1,560,000 would
require a constant, compounded growth rate of 1.38%. This makes no allowance for
any lines dying out.
History
may be a better guide to the number of ancestral lines for, particularly,
occupational surnames. However, the method requires research and analysis to be
detailed and fitted to each target population’s history. Techniques and data
needed will be different for each project.
For geographical area projects, historical analysis may be the only relevant
method. Data may be readily available to establish the ancestral lines of
interest..
 A_{H} ≈ Σ(CMA) + NPE_{i }– NPE_{e} – Ended
 Where CMA =
ancestors; NPE_{i}
= NPEs in; NPE_{e} = NPEs out; Ended = Lines with no current descendants.
Simplifying assumptions will vary with the project’s particular target population. For Englishorigin
surnames, they include:
 That historical
data from the period close to surname adoption can be found to estimate A.
 That only one
family of each surname was allowed per town.
 That families of the same surname do not share a common male ancestor.
Example:
As a
“rough cut”, we might estimate  for the Taylor surname  there was at least
one family with this occupation in each midsized or larger English town during
the mid14^{th} century. Assuming that only one family of each surname
was allowed per town, this gives us a rough estimate for A somewhere between
1,000 and 3,000. However, there are historical indications that some tailors
were itinerant, moving from town to town and setting up tents for their places
of business.
To refine
that estimate: About 2,400 English towns had markets &/or fairs by 1516.
(Source: “Gazetteer of Markets and Fairs in England and Wales to 1516”[15].) Some of these towns would have had
local tailors; others would have itinerant tailors traveling from fair to fair.
We do not have an estimate for how many of these paternal lines may have
died out and ceased
to have direct male descendants in the present. [16].
Adjusting year 1516 to year 1355:
A random sample[17]
of 33 places the Gazetteer names in England & Wales found
only two places had earliest recorded dates later than 1350 (i.e., 1358 & 1425) and both were
“prescriptive”, suggesting long custom. The mean date of grant for the sample
was 1252 with a standard deviation of 62 years. We take this to mean there was
at most a 3% growth in the number of midsized to large English & Welsh towns
from the mid13^{th} century to the early 16^{th}.
Let T_{all}
= total number of towns large enough from markets &/or fairs
T_{all} =
2,400 * 0.97 = 2,328
Adjusting for itinerants:
Assume that the 25% smallest of
T_{all} were served by itinerant tailors and that each would serve
an average of four such towns[18]. Let T_{res}
= towns with resident tailors, T_{itin} = towns with itinerant tailors,
I = number of itinerants and A_{H} = total number of tailors who took the Taylor surname:
T_{res} = T_{all} * 0.75 = 2,328 * 0.75 = 1,746;
T_{itin}= T_{all} * 0.25 = 2,328 * 0.25 = 582;
I = 582 / 4 = 146
A_{H}
≈ T_{res} + I = 1,746 + 146 ≈ 1,892.
Another Example:
For another example based on historical  or, more accurately,
prosographic  methods, see
"Gifford Origins" by Paul Gifford.
With ingenuity & persistence, similar historical analyses may help develop a priori
estimates for shedding light on A_{H}
for other surnames of other origins.
In some
projects, it may be possible to obtain or develop ror each participant a pedigree or
“tree” of sufficient quality to identify all of the original common male
ancestors (CMA). This may be more feasible for projects with small target
populations than those with large target populations.
This
leads to:
A_{T}= Σ(CMA)
Two approaches are possible:
 Bottomup  Participants submit pedigrees, possibly with sorting
and collating by project administration, and
 Topdown  Pedigrees are developed by one person or team through research into all recorded instances of the surname.
(See Guild
of OneName Studies.) This is an arduous process, feasible, perhaps,
only for lesscommon surnames.
Simplifying assumptions:
 That all common
male ancestors of existing lines have been identified in the trees;
 That the trees are credible and meet genealogical standards.
Examples:
The
Taylor project is not amenable to this approach at this time; an
insufficient number of ancestral trees of adequate quality have been collected.
However, it has been used successfully:
 The Clan Irwin project has used the tree approach in
combination with DNA to identify about 41 ancestral lines to date. A
description of the project is
here.
 The Cruwys/Cruse project has used this method, described
here.
The
methods described above for determining the number of ancestral lines (A) are “a
priori”; they are independent of the findings or procedures of the project. But
a priori is not the only way available. A separate, alternative method (“a
posteriori”) uses the project’s observed match rate to estimate the number of
lines for the subject population.
Let “Y”
represent the number of participants with YDNA STR results and “M” the number
of participants with at least one match. The “match rate”, R, equals M/Y and
A_{P} ≈ Y/R = Y/(M/Y) = (Y^2)/M
i.e., the square of project size divided by match rate.
Simplifying assumptions:
 That project size is sufficiently large to capture an adequate number of ancestral lines;
 That a project’s YDNA results representatively sample the existing ancestral lines without undue
bias to particular sets of lines;
Example:
The
Taylor Family Genes project had 318 participants with YDNA results (as of 30
May 2010), of whom 122 have at least one match. The yields an estimate of
A_{P}= (318^2)/122 = (101,124)/122 = ~839.
Update July 2015: A_{P}= (651 YSTR 75 NTP)^2/352
= (576)^2/352 = ~942
Comment on example: A = ~839 is a much lower estimate than the A = ~
1,892 derived by the historical method. The F/A ratio would become
~0.292, instead of ~0.121. It is, however, much higher than the estimates of A_{G} =
~249 and A_{F} = ~234.
Summary:
This
method is more heavily influenced by project departures from random sampling
procedures than the others described here. It is also influenced by the match rules[19]
of the project. It is also circular, in that a project’s past findings become a
determinant of expected findings. (In other words, “It is so because it is so.”)
The a
posteriori method may, however, be useful in establishing a “ballpark” for an
ancestral lines estimate and it relies on no data other than the project's
results.
Summary of Methods for Estimating A:
We have
described five methods for estimating the number of ancestral lines.
 Population Growth: A_{G} ≈ H / (1+G)^(D_{2} – D_{1})
 Population Frequency: A_{F} ≈ P * F / (H_{P} * 10^5)
 Historical: A_{H} depends on historical circumstances at the name’s origin.
 Tree: A_{T}= Σ(CMA)
 Posteriori: A_{P} ≈ (Y^2)/M
It is,
perhaps, not an exhaustive list. Others may envision additional & more precise
approaches and methods.
Examples:
For our Taylor examples, we derived the following estimates:
 Growth:
A_{G} ≈ 249, with upper 90% CL ~693;
 Frequency: A_{F}
≈ 234;
 Historical:
A_{H} ≈ 1,892;
 A Posteriori: A_{P}
≈ 839.
Comments on examples:
These example estimates display a large spread,
which, it has been suggested, jeopardizes the practicality of the logic. Alternative
views are that
 Different methods should be expected to produce different results; very
different methods may produce very different results;
 A large spread may indicate a "true" estimate is somewhere between
low and high;
 A large spread demonstrates the difficulty of the problem;
 Evaluation of an estimate's credibility should be withheld until the F/A
ratio is assessed.
Estimation errors & confidence limits:
Where
would a statistical examination be without some discussion of errors and
confidence? All measurements or estimates are subject to error and the size of
the errors, relative to the estimate, determines how much confidence to place in
it.
In
compound estimates, such as these, errors may easily compound upon
one another rather than cancel out, thus greatly enlarging the range for
confidence levels. Further,
for some methods, the builtin error may be indeterminable  at least partly due
to the simplifying assumptions. It may not be
possible to construct a confidence bounds estimate based on the underlying error
estimates.
The
Taylor estimate examples above have a very wide range; the highest estimate is
almost an order of magnitude (10 times) greater than the lowest. The mean of the estimates
is 804 with a standard error of the mean = 778. Based on this information alone,
the 90% confidence bounds would be
1 < A < 2,595.
To review "lines found", the larger
F is in relation to A (F /A ratio)
the more likely a new participant will match another in the project.
If we let S stand for the number of unmatched
singletons and G stand for the number of matched groups or clusters,
then:
F = S + G.
ESTIMATING MATCH (& NONMATCH) PROBABILITY  THE MATH:
The
likelihood of a new participant matching any other participant or participants
is the ratio of found lines to ancestral lines (F:A
or F/A), which establishes the expected probability. The
closer this number is to unity (1), the greater the probability of matching:
Pr(match) = F/A, Pr(nonmatch) = 1  F/A
Mathematically, this is a "sampling without replacement" problem; a participant
once tested does not reenter the untested pool. Each new participant, if selected randomly, has a
statistical chance of matching another by the fraction F/A and (by not matching) of adding to the
number of lines sampled by 1  F/A.
Assuming
A >10*F ("infinite population"), [22]
 Variance is given
by Var(F/A) = [(F/A)*(1F/A)/F] and
 Standard error or
the estimate by SE(F/A) = sqrt[(F/A)*(1F/A)/F].
 The functions
Var(F/A) = (F/A)*(1F/A) AND SE(F/A) = sqrt[(F/A)*(1F/A)/F] reach maxima at
0.25, when
F/A = 0.5;
 Their minima are at F/A=0 (i.e., F=0) & at F/A=1 (i.e., F=A).
As F > A, F/A >1 and the theoretical probability of a new participant matching an
existing participant > 100%. However, even as F/A approaches 1, the likelihood
of matching will remain below 100%. (One reason for this is that some
participants will represent the last of their lines and no one else living will
be available for testing.)
Let's
assume some numbers and see where they take us:
 F=50, A=100:
F/A = 0.5 = 50%; 1
new participant is as equally likely to match another  thus, joining a
group or helping to form a new group  as to not match any existing
participants  and thus add to the number of
found lines. (The 90% confidence level is ~+/8%.)
 F=50, A=1000:
F/A = 0.05 = 5%; 1
new participant has about a 5% chance of matching and a 95% chance of not
matching. (The 90% CL is ~+/1%.)
 F=750, A=1000: F/A = 0.75 = 75%; 1
new participant has about a 75% chance of matching and a 25% chance of not
matching. (The 90% CL is ~+/2%.)
These
hypothetical examples demonstrate the profound effect of the F/A ratio on probabilities of
matching:
A project with large A requires a proportionately large F to attain a
similar matching probability to one with a smaller A.
Taylor example:
The 240 found lines yields F/A ratios as follow:
 F/A_{T}
– undetermined, insufficient data
 F/A_{H} = 240 / ~1,892 =
12.7%
 F/A_{P} = 240 / ~839
= 28.6%
 F/A_{G} = 240 / ~249
= 96.4%
 F/A_{F} = 240 / ~234
= 102.6%
Comment on example: From the
“inside”, it seems difficult to believe that this project with such small penetration[23] has
found as many as 90% of the ancestral lines that exist in the target population.
Participants represent about 44 per 100,000 of those enumerated with the surname
in the 2000 US federal census .and about 30 per 100,000 of the most recent US &
UK censuses combined. Nor, does it seem reasonable that more than one million
Taylorsurnamed people have only 240 common paternal ancestors after less than
700 years.
F/A ratio as an indicator:
We might
expect the ratio of F to A to vary from project to project and within a project
during its life. Suggested interpretations for various values include:
 F << A (F/A << 1) –
Typical of a beginning project. Only a few of the ancestral lines have been
tested.
 F < A (F/A < 1) –
Typical of most projects. Some ancestral lines have yet to be found.
 F=A (F/A=1) –
The project’s “survey” phase is essentially complete. Most (or all) ancestral
lines have been tested.
 F > A (F/A > 1) – Possibilities include:
 NPEs have added to the ancestral lines, or
 The estimate for A is not valid.
 F >> A (F/A >> 1) – Possibilities include:
 The estimate for A is not valid, or
 Match rules should be reviewed.
The binomial distribution is useful for estimating the probabilities of various
numbers of matches in a group of new members.
The
problem of how many matches to expect in a sample of new participants fits a
"Bernoulli Trial" model; a match is a "success" and each new participant is a
"trial". We can use the binomial probability function to tell us the
probability, "Pr(k=x)", that we will get exactly "k" successes (matches) in "n"
trials (new participants).
[24].
Pr(k) = n!/[k!*(nk)!]
* (F/A)^k * (1F/A)^(nk)
[25]
For the
number of successes, k, in n trials:
 The expected
value is given by the formula E[X] = np = n * F/A and
 The variance by
Var[X] = np(1p) = n * F/A * (1  F/A).
So, if F/A = 0.5. the probability of exactly 3 matches for 10 new participants is
 Pr(k=3) = 10!/[3!*(103)! * 0.5^3 * (10.5)^(103) = 120 * (~0.000977) = ~0.117 = ~11.7%

Similarly, we can show that the probability of exactly
 1 match = ~0.00977 = ~0.977%
 2 matches = ~0.0440 = ~4.40%;
 Σ[Pr(1=>k<=10)] ≈ 0.99 ≈ 99.90%
 Meaning that the probability of 3 matches or more is
 Pr(k>=3) = (0.99.9  0.00977= ~0.609) = ~60.9%
But, if F/A = 0.05, the probability of matches in 10 new participants is:
 1 match = ~31.5%
 2 matches = ~7.46%
 3 matches = ~1.05%,
 Σ[Pr(1=>k<=10)] ≈ 0.096 ≈ 9.6%
 And, the probability of 3 matches or more is
 Pr(k>=3) = ~(.096  0.096) = ~0.0001 = ~0.01%
Binomial
distributions behave differently for small probabilities than they do for
moderate or large probabilities (p>0.5).
This graph shows the probability of getting
exactly x matches in 10 trials at various values of p.

The two
graphs below show the probabilities of getting x matches or more in a group of
10 participants for p<=0.1.[26]


Note the “hockey stick”
shape to the cumulative probability curves as the slope gets close to zero.
.For example, with p=.01 Pr(k=1) ≈ 9.14%; Pr(k=>2) ≈ 9.55% and for all remaining values >= k, Pr(k>=x) ≈ 9.56%.

The graph
below shows the much different curves for higher probabilities (p>=0.5).

Note
the Sshaped curves. Slope increases initially, then levels off.

Finally,
we might ask, “What is the probability for any matches in a group of 10
new participants?” The answer varies with the F/A ratio; it is the sum of the
probabilities of 1 to 10 matches, as shown in the graph and table below.

F/A ratio = 
0.01 
0.02 
0.05 
0.1 
0.3 
0.5 
0.75 
0.9 
Prob. ≈

9.6% 
18.3% 
40.1% 
65.1% 
97.2% 
99.9% 
~100% 
~100% 
As the
F/A ratio à1, the Pr(k>0) à 100%,
Binomial
probabilities rest on the probability of success in each trial (p=F/A), the
number of trials (sample size) and the number of successes of interest.
We have
attempted to introduce a way of looking at the probabilities of new participants
matching other participants in relation to project size; we hope it is helpful
to project administrators.
 It is worthwhile for project administrators to estimate the number of
separate and unique ancestral lines (A) for the project’s target group. This
may not be an easy task, but it is an important aspect of project
administration.
 If surveying the YDNA of the target population is a project goal,
estimating the number of ancestral lines (A) establishes the size of the
task.
 Five methods for estimating this number (A) were described, with an
example for each method:
 Population growth,
 Surname frequency,
 Historical analysis
 Ancestral Trees &
 A posteriori.
 The total number of ancestral lines (A) within the project's target
population is one
of two variables determining the probability of matching. The other variable is the
number of lines found within the project (F).
 The ratio of found lines within the project (F) to ancestral lines (A) reflects a project’s
life cycle or its performance. Young or poorlyadministered
projects may have F/A ratios near zero;
mature and welladministered projects may have F/A ratios near one.
 The ratio of found lines within the project (F) to ancestral lines (A) determines the probability of a new
participant matching existing participants: Pr(match, n=1) = F/A.
 A project with a small number of ancestral lines (A) can achieve a high
match rate with small project size and a small number of found lines.
 A project with a large number of ancestral lines (A) may not achieve a
high match rate until it has found a large number of lines and thus attained
a larger project size.
 Given projects of approximately equal size, a lower match rate may
suggest a larger number of ancestral lines.
 The binomial distribution provides a means of estimating the probability
of a particular number of matches within a group of new participants.
 More research into the origins and distribution of surnames is needed.
To revisit our opening statements in light of this exploration, the number
(or proportion) of unmatched participants is less related to absolute project
size than it is to the number of found lines, relative to ancestral lines  in
other words, the F/A ratio.
This is not intended as the final word on this subject. If we have sparked
interest in further examination, our goal is achieved.
Version: 10 Oct 2010
End Notes:
[0]
Example end note: Click the blue underlined number to return to your place in the document.
[1] We use
the genealogical term “lines”, rather than the genetic term “haplotypes” because
our focus is not on genetics per se, nor specific individuals but genetic
genealogy as applied to groups. Line is not exactly synonymous with haplotype,
in that a line may incorporate variations on the line founder ancestor’s
haplotype.
[3] A full
discussion of genealogical time frame is beyond the scope of this undertaking.
ISOGG defines it at
http://www.isogg.org/tree/ISOGG_Glossary.html
as “A time frame within the last 500 up to 1000 years since the adoption of
surnames and written family records. An individual's haplotype is useful within
this time frame and is compared to others to help identify branches within a
family.”. A generaluse definition starts it with widespread adoption of
surnames – a practice whose dates which vary with cultures, from as early as
2,500 BCE for the Chinese to as late as 1811 CE (AD) for the Dutch. For most of
Western Europe, it is considered to begin sometime from 1350 to 1400 CE. One
explication is at
http://www.daltongensoc.com/diharchive/6_8_August_2003/text.html.
[4]
“Daughtered out” means that a male’s progeny includes only daughters and has no
sons to which to pass his Y chromosome. The subject of lines with no living
direct filial descendants is worthy of exploration. However, it is beyond the
scope of this undertaking.
[6] The
importance of the Black Death (bubonic plague, Yersinia pestis) – a truly
horrible disease – pandemic in any discussion of surnames in Western Europe can not be
overstated; it is the single most significant event. It killed an estimated 33%
to 40% of the European (& British) population from 1348 to 1350 and recurred in
several epidemics until the 19^{th} century. The morbidity & mortality
resulted in much social & economic turmoil, attributed by some as the cause for
permanent social changes such as universal surnames. See also
http://en.wikipedia.org/wiki/Black_Death, which quotes Philip Daileader as
estimating English mortality closer to 20% than 40% and some other countries at
60% to 80%.
[8] We have
picked 1355 as the date for the beginning of the genealogical time frame due to
its proximity to the first and most deadly Plague attack; the reader is free to
choose other dates and work out the mathematics. There are some (unverified)
indications that a 1353 edict by England's King Edward III mandated that subjects then
without surnames take one.
[9] The
confidence level comes from the standard deviation of the growth rates for the
various periods considered. The rate averaged 0.91% with a standard deviation of
0.72%. This was the only estimate for which a basis for calculating a standard
error was found.
[11] Due to
absence of standardization in ancient spelling – with some variants becoming
“standard” – it is usually necessary to consider all spelling variants when
relying on census data to estimate A. The family historian needs to appreciate
that many (95%?) of our ancestors were illiterate; the written versions that we
see of their names were spelled by others, according to the writers’
idiosyncrasies. (The author, for example, has one ancestral line with a
onesyllable surname  "Cales"  with more than a dozen spelling variants.)
[12]
Throughout, I will use the Taylor project as one example. The primary reason for
this choice is that the necessary data is available to me. The reader is invited
to supply his or her own examples.
[15]
Source: “Gazetteer of Markets and Fairs in England and Wales to 1516”,
http://www.history.ac.uk/cmh/gaz/gazweb2.html. A market was held weekly; a
fair was held annually. Most were held under grants by the Crown – either by
warrants or letters; some others were held “prescriptively” (by ancient custom),
often predating Norman times. In a random sample of 33 places named, only
two were found to have a grant later than 1350 and both were prescriptive. The
mean date of grant for the sample was 1252 with a standard deviation of 62
years.
[16]
Editorial opinion/rant: The author has read much of what has been written in
regard to the origins, distribution and variations of the Taylor surname. It is
mostly junk – unfounded speculation &/or wishful thinking. Leaving aside
unattributed quotation & paraphrasing (i.e., circular plagiarism), many
pontificate but few bother with sources or evidence.
[17] The
full Gazetteer list of place names was copied and pasted into an Excel
spreadsheet and a list of random numbers generated to point to locations on the
list. Dates of founding or first grant were then looked up for the indicated
places and recorded in the spreadsheet.
[18] The
author has found no substantiation of the number of itinerant tailors – beyond
vague mentions that they did exist – and no information on the economics of
their practices. In other words, we substitute pure guesses.
[19] Rules
for declaring a “match”, therefore a group or cluster, vary from one project to
another. “Tight” match rules produce a lower match rate and thus a higher
estimated A by the a posteriori method. “Looser” rules produce a higher match
rate and thus a lower estimate for A.
Taylor project match rules, effective March 2010, are "A high probability of
sharing a common male ancestor within a genealogical time frame (since 1350)."
This translates to the following types of matches: >=24/25, >=34/37, or >=61/67.
(No 12marker matches qualify under current rules, though previouslydeclared
matches have not been rescinded.)
[20] Group,
in our sense, means multiple participants found to have matching DNA. It is not
used in the sense of a project.
[21] In
addition to the 39 groups matched on the basis a probable common male ancestor
within a genealogical time frame, there is a group for the Western Atlantic
Modal Haplotype (WAMH) and another for those who differ from WAMH on one marker.
[22] To correct for "finite population" (i.e., A < 10*F), multiply by FPC =
sqrt[(AF)/A1)].
[23]
“Penetration” is used here in the sense of ratio of project participants to the
target population. Source: James Irvine, “Towards improvements in yDNA Surname
Project Administration”, ISOGG forthcoming. His definition is "the ratio of the
number of yDNA tests for a given surname to the world population thereof.".
[25]
For those not familiar with mathematical notation: the Greek letter sigma (Σ) indicates summation.
Exclamation points represent factorial numbers; 3!=3*2*1=6; 4!=4*3*2*1=24, etc. Carats, ^, mean
“raised to the power of”. Parentheses & brackets indicate order of operations.
≈ means “approximately equal to”; ~ means “approximately.”; < means “less
than”; << means “much less than”; > means “greater than”; >> means “much greater
than”.
[26] The
formula for cumulative probabilities in the graphs is Pr(k>=x) = Σ(Pr(k=1 to 10)  Σ(Pr(k=0 to x1)
[27] An alternative view is thinking of
"Singletons as those waiting to be found or discovered to be NPE". The project administrator who wishes to take
this view may reduce the singletons component of found lines by an NPE factor,
thought to be 35%.
[28]
Debbie Kennett wrote on 14 Sep 2010: "From anecdotal evidence and my own observations, it would appear that
many surnames in the US have been the subject of founder effects with one emigrant ancestor often accounting
for thousands of living descendants. In contrast the lines in the British Isles are often on the verge of
extinction, sometimes with only a handful of living descendants.
If therefore you have one line in America with 2000 living males and five lines in England, each of which have
only two descendants, if you take a random sample, you are highly unlikely even
{t}o sample anyone from the
English lines. The problem is compounded because Americans are much more likely to pay for a DNA test. In
practice, what this means is that Americans who take a DNA test are much more likely to have a match than their
counterparts in the British Isles."
[29]
To be sourced & evaluated  "c.4% of fatherson transmissions have 1 or more mutations in 67 markers, c.0.4% have 20 or more
mutations."
[30]
There are, of course, exceptions  called nonparental events, abbreviated NPEs.
The paternal surname practice originated in China, ca2500 BCE, and became universal in most of Europe by 1400 CE.
Late adopters of the practice include the Netherlands (1811) and Scandinavia (late 19th century).
[31]
The A_{G} ≈ 249 estimate is about 1.04 times
current F, yielding a F/A ratio ≈ 0.96. For it
to prove valid, new participants should experience a match rate of ≈ 96%.
[32]
Household size estimates are problematic, particularly across countries and long
time spans. They depend greatly on culture and economic situation.
A reviewer (James Irvine) comments on 14 Sep 2010: "Household size: an old bit of work I did says: The ratio of adult males
to total population depends on the age chosen for “adult”, and varies from country to country. In USA today 40%
of the population are males over about 16. (It so happens that the average household size around the world is
2.5 (www.nationmaster.com > People Statistics > Average size of households),
i.e. 40% of the world population are “householders”, and so this arbitrary
definition of adult males is numerically similar to the number of
householders)."
[33]
[34] This
exploration is analogous to the "Drake Equation", proposed by Frank Drake in
1962 to estimate the number of civilizations in the Milky Way Galaxy who we
might communicate with. Depending on multiplying together a number of factors
whose values were mostly unknown, Drake intended the equation to be merely a
means of organizing discussion about research to be performed. He did not expect
a final answer.