Y-DNA PROJECTS: Project size vs. Unmatched Participants
– an exploration for project administrators

On this page:

by Ralph Taylor

It's sometimes said: "For any Y-DNA project, the number of unmatched participants will go down as the size of the project increases." A more sophisticated statement, perhaps, would be ".. the proportion of unmatched participants will go down as the size of the project increases." Either version seems intuitive, but may not be universally true and the applicability is of concern to project administrators.

How does project size relate to the chances of new participants matching or not matching existing participants?

Hardly any historical or scientific research attention has been paid to the subject; there is little definitive or authoritative guidance; we project administrators are on our own. It is, however, a question worth examining more objectively.


{Note: Throughout, there will be superscript numbers, in blue & underlined, like this[00] . They indicate end notes for sources, explanations, etc. Each is a link to its separate note. due to re-organization of the material, order of the notes is not necessarily reflective of place in the document.}

We'd like to think that each new participant without a match puts us a step closer to everyone matching; but does it? And if so, how big is that step?  Our concern, I think, relates to unmatched participants representing unrealized promises from Y-DNA testing -- that it will aid them in their family history research. We’d like to see all our participants find a match and identify their CMAs (common male ancestors).

 I've given some thought to the questions, on behalf of our ~60% of Taylor project participants without matches. This writing will attempt to present a way of looking at the situation which may help project administrators to arrive at answers.

Note: As of July 2015, the percentage of project members without matches in the project has declined to ~46%.

Project variety

The variety of Y-DNA projects is large; considering there are more than 7,000 of them, it isn't surprising. There are surname projects, geographical area projects and haplogroup projects (which are not addressed here). Within surname projects, too, there is a wide variety; some concern rare names of single-point origin; others focus on common names of multiple origins. Some projects (e.g., for Scots clans) include multiple surnames.

We are looking for a model which is useful to administrators of as broad a range of size and focus as possible.

To begin:

Let's start our examination with two concepts:

  1. The number of direct ancestral paternal lines[1] (AKA, "patrilines") for the surname, region, etc. -- the "population". (Think of the total number of common ancestors at some past time.) Let's designate this number by the letter "A" to represent the ancestors who gave rise to all us descendants. 
    and
  2. The degree to which the sample (project participants) represents the variety of ancestral lines within the project target population. Let’s call this represented array “Found lines” and designate it by the letter “F”.

 The Euler’s Circle diagram below illustrates the concepts in combination. Found lines are a part of – but not all of -- the patrilines for the project’s target population.


Conceptual Diagram
In this diagram, the Found Circle is small in relation to the Ancestral circle. The number of Ancestral lines greatly exceeds Found lines.
 

The ratio of Found Lines to Ancestral Lines (F/A or F:A) is important; it

  1. Establishes the probability that a randomly-selected new participant will match one or more existing participants, and
  2. Indicates the degree to which the project has surveyed the target population's Y-DNA.

The remainder of this exploration continues with a more detailed examination.


FOUND LINES (F):

We'll start with the simpler-to-determine "Found lines". Here, we depart from other approaches. What we mean is NOT the total participation in the project. Nor is it “penetration” relative to population as described by James Irvine  In contrast to ancestral lines, the number of found lines is easy to determine within the project.

“Found lines” is a reasonably simple calculation,  the number of unmatched participants plus the number of DNA-matched groups or "clusters" of participants. We can call this number "F".

Each singleton represents a line and each group[20] or cluster (all matching participants in the group taken together) represent one line for each group.

Note: Surname projects must make adjustments for The S component in the formula below should be reduced to account for these. 

If we let S stand for the number of unmatched singletons[27] and G stand for the number of matched groups or clusters, then:

F = S + G.

Found lines (F) is a number which project administrators may influence over time. Ancestral patrilines (A), to be discussed below, is beyond project control; it is what is is, based on past events, and can not be affected by project administrators. 


Possible relationships between S & G are beyond this present exploration's scope. However, it may be found here.

The larger F is in relation to A (F /A ratio)  the more likely a new participant will match another in the project.

Sampling problems:

To face reality, our projects’ samples are not randomized and not necessarily representative.

The sampling problems are perhaps insurmountable. We may not be able to achieve true random or representative sampling.  In the interest of getting to an approximate probability model, however, we'll leave it to others to assess how this reality affects our sampling and proceed as best we can.

It should be noted that, for most projects, A > F until all ancestral lines are found. Every new participant will either match one or more existing participants or add to the number of found lines.

Example:

The Taylor project has identified ~240 separate & unique found lines.

Update, July 2015


ESTIMATING MATCH (& UNMATCH) PROBABILITY -- THE F/A RATIO:

The likelihood of a new participant matching any other participant or participants is the ratio of found lines to ancestral patrilines (F:A or F/A), which establishes the expected probability. The closer this number is to 1, the greater the probability of matching

Pr(match) = F/A.

More below the discussion of Ancestral Lines (A).


ANCESTRAL PATRILINES (A):

What we mean by “ancestral lines” (A) is that number of paternal ancestors which can not be reduced further by finding intersections of lines of descent (“nodes” or common ancestors) within a genealogical time frame. This term is roughly congruous with “haplotypes” [2]. Imagine that one could identify every original ancestor for the target population who has living direct filial male descendants.

 One way to picture the concept is by comparison with the “Founder Effect”, except that these founders do not necessarily leave the larger population in which they’re embedded.

Another way we might picture this is by reversing our usual perspective: Imagine pedigrees -- back to the beginning of genealogical time -- for every person in a project's subject group. The nodes, where pedigrees intersect each other are Most Recent Common Ancestors (MRCA). But, some pedigrees may not intersect others at all; they represent independent lines of descendancy. This will be especially true for occupational surnames and some regional projects.

 The number of ancestral  lines for any particular surname or region has received very little research attention. I'd argue, however, that it is vitally important to Y-DNA project administrators, in order that they understand the nature of the task before them. It sets out the size of the task.

Axioms of Y-DNA projects

All Y-DNA projects share the same axioms:

  1. That Y-chromosome haplotypes point to a unique individual ancestor, the “founder” of each Y-DNA haplotype  (ancestral patriline).
  2. That Y-DNA is transmitted from father to son almost without change. [29]
  3. That Y-DNA remains relatively stable over many successive transmission events, through many fathers to sons, grandsons, great-grandsons, etc.
  4. That surnames in in many cultures are also passed down father to son, without appreciable change. [30]

From these axioms come corollaries:

  1. That, at the beginning of a genealogical time frame [3], there existed a finite number of common male ancestors founding each unique Y-DNA pattern.
  2. That this number is – at least sometimes – knowable or subject to estimation.

 The number of ancestral  lines, A, may be small or large –  ranging from a handful to thousands –  and it can be difficult to estimate. Presumably, the more common & multi-originated the surname, the more lines will exist in the population and have existed in the past. In some cases, the number may be unknowable to an acceptable precision; however, it is always finite, greater than zero and its true value is independent of a project’s DNA results or sampling.

1 patriline
Single ancestral
patriline
3 patrilines
Three ancestral patrilines

In the diagrams above, the red circles indicate ancestors; the green circles represent living descendants for Y-DNA testing.

Factors affecting A, the number of ancestral patrilines:

The number of lines which may have existed at some past time is not necessarily the number with living descendants. The number may have decreased or increased since then:

  1. Non-parental events (NPEs) -- adoptions, name changes, etc. -- will both add to and subtract from this number. Whether the additions & subtractions cancel each other or are negligible in overall effect probably varies with the surname or region.
  2. Some paternal lines will have died out [4] during the Genealogical Time Frame. They will have no living descendants for testing. A generation may have died without issue or have had only daughters, thus breaking the direct paternal descendancy.
     
  3. A reviewer notes that there may be a significant difference between the number of living descendants per line in the United States and in the country of origin. For her comment, see the end note. [28]

How many ancestral patrilines?

Depending on the nature of the project, A may range from one (single origin) through a handful (plural origin) to thousands (multi-origin). Large geographical projects of recent population growth (e.g., “Southern California”) or common surnames of multi-point origin (e.g., “Smith”) will have a greater A than rare-surname projects. As we shall see later, this has significant implications for project administration.

METHODS FOR DETERMINING ANCESTRAL LINES:

Project administrators will find little useful guidance for estimating the number of ancestral patrilines in genetic or genealogy literature. The subject has, seemingly, been considered unworthy of serious attention.

We will describe five separate approaches for estimating A and give an example for each:

  1. Population Growth model;
  2. Surname Frequency model;
  3. Historical analysis;
  4. Tree model and
  5. A posteriori use of project findings.

Some of these methods may be most useful for surname projects; some for geographical projects. All require many simplifying assumptions and are subject to error. [34] Results of multiple approaches may be assessed to suggest the most reasonable estimate.

The primary purpose is to illustrate that such estimates are possible and to start discussion about how they may be done.

Population Growth Method:

The population growth approach consists of regressing growth back to an earlier (& typically lower) population. One starts from the assumption that population growth (or decline) for a particular surname or region matches that of the population as a whole. This allows one to work from an estimated number of patrilines at one point in time to an estimate for an earlier point in time by means of compound growth formulas.

Here are two graphs of the estimated British population from 410 (after the Roman era) through 1999. [5] The left side shows population in millions with percent change. (The Black Death change is off the scale.) The right side shows the logarithmic values for population.


Estimated Population & Percent Change

Estimated Population (Logarithmic)

Note that the logarithmic curve from after the Black Death[6] to modern times is very roughly straight, with a positive slope. Annual population growth per year in this 650-year period ranged from a low of 0.36% to a high of 2.15%, with a mean of 0.91% & standard deviation of 0.72%.

The mathematics resemble a compound interest problem. Let AG = number of households with surname at an earlier date, D1, and H = households number at a later date, D2, with growth rate G;  then,

H  ≈  AG * (1+G)^(D2 – D1) therefore
AG ≈  H / (1+G)^(D2 – D1)

Simplifying assumptions:

  1. That each household contains one and no more than one CMA, the father;
  2. That number of persons per household remains relatively constant over the period of examination;
     

Example:

Let us start with the 22,287 Taylor households reported[7] in the 1851 English census and work backwards to an estimate for 1355[8].

AG  ≈ 22287 * (1.0091)^(1851-1355) = 22287 * 1.0091^(496) = 22287 * 89.406
AG ≈ 249, with an upper 90% confidence bound of 693[9].

Comments on example:

  1. It is has been ~660 years since 1355; for 249 Taylors then to produce the present ~1,560,000 would require a constant, compounded growth rate of 1.37% per year. This makes no allowance for any lines dying out during >6 centuries.
  2. AG  ≈ 249 is just slightly more than the number of “found lines” (to be discussed below) in the project. The “acid test” of the validity of this estimate will be if the project fails to add more singletons as it grows in size. [31]

Another Example:

For another example based on the population growth method see "How many Whitneys?" by Tim Doyle.

Frequency Method:

The frequency [10] of a surname in modern times may be a rough indicator relative to other surnames. More or less frequency of a surname is in recent times, may suggest its relative prevalence in earlier times.[11]

The 20 most common surnames in
the United States. [13]

Rank Name     Count      Per 100k
 1 SMITH 2,376,206 880.85
 2 JOHNSON 1,857,160 688.44
 3 WILLIAMS 1,534,042 568.66
 4 BROWN 1,380,145 511.62
 5 JONES 1,362,755 505.17
 6 MILLER 1,127,803 418.07
 7 DAVIS 1,072,335 397.51
 8 GARCIA 858,289 318.17
 9 RODRIGUEZ 804,240 298.13
10 WILSON 783,051 290.27
11 MARTINEZ 775,072 287.32
12 ANDERSON 762,394 282.62
13 TAYLOR 720,370 267.04
14 THOMAS 710,696 263.45
15 HERNANDEZ 706,372 261.85
16 MOORE 698,671 259.00
17 MARTIN 672,711 249.37
18 JACKSON 666,125 246.93
19 THOMPSON 644,368 238.87
20 WHITE 639,515 237.07

The 20 most frequent surnames[14] in
the United Kingdom:

Rank Surname Count Per 100k
 1  SMITH 652,563 111.00
 2  JONES 538,874 91.66
 3  WILLIAMS 380,379 64.70
 4  TAYLOR 306,296 52.10
 5  BROWN 291,872 49.65
 6  DAVIES 279,647 47.57
 7  EVANS 222,580 37.86
 8  THOMAS 202,773 34.49
 9  WILSON 201,224 34.23
 10  JOHNSON 193,260 32.87
 11  ROBERTS 187,871 31.96
 12  ROBINSON 165,193 28.10
 13  THOMPSON 162,920 27.71
 14  WRIGHT 161,391 27.45
 15  WALKER 155,734 26.49
 16  WHITE 154,147 26.22
 17  EDWARDS 153,284 26.07
 18  HUGHES 151,269 25.73
 19  GREEN 145,856 24.81
 20  HALL 145,231 24.70

If it is assumed that frequency in the present population is indicative of frequency in a distant past, the present frequency may be multiplied by the total population in the past to estimate the number of individuals at that time. One may “daisy-chain” an estimate of household size to calculate A at the time of interest. [32]

Let Fr = frequency in persons per 100,000 in modern times, T = target population at the time of interest, and
P = total population at time of interest, then

T = P *  (Fr / 10^5)

And for the daisy-chain, let HP = number of persons per household, then:

AF ≈ T  /  HP   ≈  P * (Fr / 10^5) /  HP ≈  P * Fr / (HP * 10^5)

Simplifying assumptions:

  1. That frequency of a surname in the population is the same at the time of interest as for the time for which data is available. This assumption may not hold if an intervening period of high immigration has been experienced.  
  2. That persons per household is the same at the time of interest as for the time for which data is available. This assumption may not hold if child-bearing practices have changed significantly.
  3. That each household contains the father. This assumption may not hold immediately after widespread wars.
  4. That household fathers do not share a common male ancestor. This may not hold for certain surnames.

Example:

52.1 per 100,000 British subjects bore the Taylor surname in 2001. The population of Britain in  2001 was ~58,790,000 (587.9*10^5) and in the late 14th century about 2,250,000 (2.2510^5). Assuming  HP ≈ 5.0, then

            T ≈ (2,250.000) * 52.1 / 100,000 ≈ 1,172
            AF   ≈  1,172 / 5.0 234.  

Comments on example:

  1. AF ≈ 234 is less than the number of “found lines” (discussed above) in the project, raising doubts as to the validity of the estimate.
  2. For 234 Taylors in 1355 to have produced the present ~1,560,000 would require a constant, compounded growth rate of 1.38%. This makes no allowance for any lines dying out.

Historical Method:

History may be a better guide to the number of ancestral lines for, particularly, occupational surnames.  However, the method requires research and analysis to be detailed and fitted to each target population’s history. Techniques and data needed will be different for each project.

For geographical area projects, historical analysis may be the only relevant method. Data may be readily available to establish the ancestral lines of interest..

Simplifying assumptions will vary with the project’s particular target population. For English-origin surnames, they include:

  1. That historical data from the period close to surname adoption can be found to estimate A.
  2. That only one family of each surname was allowed per town.
  3. That families of the same surname do not share a common male ancestor.

Example:

As a “rough cut”, we might estimate -- for the Taylor surname -- there was at least one family with this occupation in each mid-sized or larger English town during the mid-14th century. Assuming that only one family of each surname was allowed per town, this gives us a rough estimate for A somewhere between 1,000 and 3,000. However, there are historical indications that some tailors were itinerant, moving from town to town and setting up tents for their places of business.

To refine that estimate: About 2,400 English towns had markets &/or fairs by 1516. (Source: “Gazetteer of Markets and Fairs in England and Wales to 1516[15].) Some of these towns would have had local tailors; others would have itinerant tailors traveling from fair to fair.  We do not have an estimate for how many of these paternal lines may have died out and ceased to have direct male descendants in the present. [16].

Adjusting year 1516 to year 1355: A random sample[17] of 33 places the Gazetteer names in England & Wales found only two places had earliest recorded dates later than 1350 (i.e., 1358 & 1425) and both were “prescriptive”, suggesting long custom. The mean date of grant for the sample was 1252 with a standard deviation of 62 years.  We take this to mean there was at most a 3% growth in the number of mid-sized to large English & Welsh towns from the mid-13th century to the early 16th.

Let Tall = total number of towns large enough from markets &/or fairs
Tall  = 2,400 * 0.97 = 2,328

Adjusting for itinerants: Assume that the 25% smallest of Tall  were served by itinerant tailors and that each would serve an average of four such towns[18]. Let Tres = towns with resident tailors, Titin = towns with itinerant tailors, I = number of itinerants and AH = total number of tailors who took the Taylor surname:

Tres = Tall  * 0.75 = 2,328 * 0.75  = 1,746;
Titin= Tall  * 0.25 = 2,328 * 0.25  =  582;
I = 582 / 4 = 146

AHTres + I = 1,746 + 146  ≈ 1,892.

Another Example:

For another example based on historical -- or, more accurately, prosographic -- methods, see "Gifford Origins" by Paul Gifford.

With ingenuity & persistence, similar historical analyses may help develop a priori estimates for shedding light on AH for other surnames of other origins.

Tree Methods:

In some projects, it may be possible to obtain or develop ror each participant a pedigree or “tree” of sufficient quality to identify all of the original common male ancestors (CMA). This may be more feasible for projects with small target populations than those with large target populations.

This leads to:

AT= Σ(CMA)

Two approaches are possible:

  1. Bottom-up -- Participants submit pedigrees, possibly with sorting and collating by project administration, and
  2. Top-down -- Pedigrees are developed by one person or team through research into all recorded instances of the surname. (See Guild of One-Name Studies.) This is an arduous process, feasible, perhaps, only for less-common surnames.

Simplifying assumptions:

  1. That all common male ancestors of existing lines have been identified in the trees;
  2. That the trees are credible and meet genealogical standards.

Examples:

The Taylor project is not amenable to this approach at this time; an insufficient number of ancestral trees of adequate quality have been collected. However, it has been used successfully:

“A Posteriori” Method:

The methods described above for determining the number of ancestral lines (A) are “a priori”; they are independent of the findings or procedures of the project.  But a priori is not the only way available. A separate, alternative method (“a posteriori”) uses the project’s observed match rate to estimate the number of lines for the subject population.

Let “Y” represent the number of participants with Y-DNA STR results and “M” the number of participants with at least one match. The “match rate”, R, equals M/Y and

AP ≈ Y/R = Y/(M/Y) = (Y^2)/M
i.e., the square of project size divided by match rate.

Simplifying assumptions:

  1. That project size is sufficiently large to capture an adequate number of ancestral lines;
  2. That a project’s Y-DNA results representatively sample the existing ancestral lines without undue bias to particular sets of lines;

Example:

The Taylor Family Genes project had 318 participants with Y-DNA results (as of 30 May 2010), of whom 122 have at least one match. The yields an estimate of

AP= (318^2)/122 = (101,124)/122 = ~839.
Update July 2015: AP= (651 Y-STR- 75 NTP)^2/352 = (576)^2/352 = ~942

Comment on example:  A = ~839 is a much lower estimate than the A = ~ 1,892 derived by the historical method. The F/A ratio would become ~0.292, instead of ~0.121. It is, however, much higher than the estimates of AG = ~249 and AF = ~234. 

Summary:

This method is more heavily influenced by project departures from random sampling procedures than the others described here. It is also influenced by the match rules[19] of the project. It is also circular, in that a project’s past findings become a determinant of expected findings. (In other words, “It is so because it is so.”)

The a posteriori method may, however, be useful in establishing a “ballpark” for an ancestral lines estimate and it relies on no data other than the project's results.

Summary of Methods for Estimating A:

We have described five methods for estimating the number of ancestral lines.

  1. Population Growth: AG ≈  H / (1+G)^(D2 – D1)
  2. Population Frequency: AF ≈ P * F / (HP * 10^5)
  3. Historical: AH depends on historical circumstances at the name’s origin.
  4. Tree: AT= Σ(CMA)
  5. Posteriori: AP ≈ (Y^2)/M

It is, perhaps, not an exhaustive list. Others may envision additional & more precise approaches and methods.

Examples:

For our Taylor examples, we derived the following estimates:

  1. Growth:           AG  ≈   249, with upper 90% CL ~693;
  2. Frequency:      AF  ≈   234;  
  3. Historical:        AH ≈ 1,892;
  4. A Posteriori:    AP  ≈   839.
Comments on examples:

These example estimates display a large spread, which, it has been suggested, jeopardizes the practicality of the logic. Alternative views are that

Estimation errors & confidence limits:

Where would a statistical examination be without some discussion of errors and confidence? All measurements or estimates are subject to error and the size of the errors, relative to the estimate, determines how much confidence to place in it.

In compound estimates, such as these, errors may easily compound upon one another rather than cancel out, thus greatly enlarging the range for confidence levels. Further, for some methods, the built-in error may be indeterminable -- at least partly due to the simplifying assumptions. It may not be possible to construct a confidence bounds estimate based on the underlying error estimates.

The Taylor estimate examples above have a very wide range; the highest estimate is almost an order of magnitude (10 times) greater than the lowest. The mean of the estimates is 804 with a standard error of the mean = 778. Based on this information alone, the 90% confidence bounds would be
                   1 < A < 2,595.


FOUND LINES (F):

To review "lines found", the larger F is in relation to A (F /A ratio)  the more likely a new participant will match another in the project.

 If we let S stand for the number of unmatched singletons and G stand for the number of matched groups or clusters, then:

F = S + G.

ESTIMATING MATCH (& NON-MATCH) PROBABILITY -- THE MATH:

The likelihood of a new participant matching any other participant or participants is the ratio of found lines to ancestral lines (F:A or F/A), which establishes the expected probability. The closer this number is to unity (1), the greater the probability of matching:

Pr(match) = F/A,    Pr(non-match) = 1 - F/A

Mathematically, this is a "sampling without replacement" problem; a participant once tested does not re-enter the untested pool. Each new participant, if selected randomly, has a statistical chance of matching another by the fraction F/A and (by not matching) of adding to the number of lines sampled by 1 - F/A.

Assuming A >10*F ("infinite population"), [22]

As F --> A, F/A -->1 and the theoretical probability of a new participant matching an existing participant --> 100%. However, even as F/A approaches 1, the likelihood of matching will remain below 100%. (One reason for this is that some participants will represent the last of their lines and no one else living will be available for testing.)

Let's assume some numbers and see where they take us:

These hypothetical examples demonstrate the profound effect of the F/A ratio on probabilities of matching:
A project with large A requires a proportionately large F to attain a similar matching probability to one with a smaller A.

Taylor example:

The 240 found lines yields F/A ratios as follow:

Comment on example: From the “inside”, it seems difficult to believe that this project with such small penetration[23]    has found as many as 90% of the ancestral lines that exist in the target population. Participants represent about 44 per 100,000 of those enumerated with the surname in the 2000 US federal census .and about 30 per 100,000 of the most recent US & UK censuses combined. Nor, does it seem reasonable that more than one million Taylor-surnamed people have only 240 common paternal ancestors after less than 700 years.

F/A ratio as an indicator:

We might expect the ratio of F to A to vary from project to project and within a project during its life. Suggested interpretations for various values include:


BINOMIAL PROBABILITY:

The binomial distribution is useful for estimating the probabilities of various numbers of matches in a group of new members.

The problem of how many matches to expect in a sample of new participants fits a "Bernoulli Trial" model; a match is a "success" and each new participant is a "trial". We can use the binomial probability function to tell us the probability, "Pr(k=x)", that we will get exactly "k" successes (matches) in "n" trials (new participants). [24].

Pr(k) = n!/[k!*(n-k)!] * (F/A)^k * (1-F/A)^(n-k) [25]

For the number of successes, k, in n trials:

So, if F/A = 0.5. the probability of exactly 3 matches for 10 new participants is

But, if F/A = 0.05, the probability of matches in 10 new participants is:

Binomial distributions behave differently for small probabilities than they do for moderate or  large probabilities (p>0.5).


This graph shows the probability of getting
exactly x matches in 10 trials at various values of p.

The two graphs below show the probabilities of getting x matches or more in a group of 10 participants for p<=0.1.[26]

Note the “hockey stick” shape to the cumulative probability curves as the slope gets close to zero. .For example, with p=.01 Pr(k=1) 9.14%; Pr(k=>2) ≈ 9.55% and for all remaining values >= k, Pr(k>=x) ≈ 9.56%.

The graph below shows the much different curves for higher probabilities (p>=0.5).

Note the S-shaped curves. Slope increases initially, then levels off.

Finally, we might ask, “What is the probability for any matches in a group of 10 new participants?” The answer varies with the F/A ratio; it is the sum of the probabilities of 1 to 10 matches, as shown in the graph and table below.

F/A ratio = 0.01 0.02 0.05 0.1 0.3 0.5 0.75 0.9
Prob. ≈ 9.6% 18.3% 40.1% 65.1% 97.2% 99.9% ~100% ~100%

As the F/A ratio à1, the Pr(k>0) à 100%, 

Binomial probabilities rest on the probability of success in each trial (p=F/A), the number of trials (sample size) and the number of successes of interest.


CONCLUSION:

We have attempted to introduce a way of looking at the probabilities of new participants matching other participants in relation to project size; we hope it is helpful to project administrators.

  1. It is worthwhile for project administrators to estimate the number of separate and unique ancestral lines (A) for the project’s target group. This may not be an easy task, but it is an important aspect of project administration.
     
    1. If surveying the Y-DNA of the target population is a project goal, estimating the number of ancestral lines (A) establishes the size of the task.
    2. Five methods for estimating this number (A) were described, with an example for each method:
      1. Population growth,
      2. Surname frequency,
      3. Historical analysis
      4. Ancestral Trees &
      5. A posteriori.
  2. The total number of ancestral lines (A) within the project's target population is one of two variables determining the probability of matching. The other variable is the number of lines found within the project (F).
  3. The ratio of found lines within the project (F) to ancestral lines (A) reflects  a project’s life cycle or its performance. Young or poorly-administered  projects may have F/A ratios near zero; mature and well-administered projects may have F/A ratios near one.
  4. The ratio of found lines within the project (F) to ancestral lines (A) determines the probability of a new participant matching existing participants:  Pr(match, n=1) = F/A.
    1. A project with a small number of ancestral lines (A) can achieve a high match rate with small project size and a small number of found lines.
    2. A project with a large number of ancestral lines (A) may not achieve a high match rate until it has found a large number of lines and thus attained a larger project size.
    3. Given projects of approximately equal size, a lower match rate may  suggest a larger number of ancestral lines.
  5. The binomial distribution provides a means of estimating the probability of a particular number of matches within a group of new participants.
  6. More research into the origins and distribution of surnames is needed.

To revisit our opening statements in light of this exploration, the number (or proportion) of unmatched participants is less related to absolute project size than it is to the number of found lines, relative to ancestral lines -- in other words, the F/A ratio.

This is not intended as the final word on this subject. If we have sparked interest in further examination, our goal is achieved.

Version: 10 Oct 2010


End Notes:


[0] Example end note: Click the blue underlined number to return to your place in the document.

[1] We use the genealogical term “lines”, rather than the genetic term “haplotypes” because our focus is not on genetics per se, nor specific individuals but genetic genealogy as applied to groups. Line is not exactly synonymous with haplotype, in that a line may incorporate variations on the line founder ancestor’s haplotype.

[2] See http://www.isogg.org/tree/ISOGG_Glossary.html, which defines haplotype thusly: “Broadly, the complete set of results obtained from multiple markers located on a single chromosome. For the Y chromosome, the term is restricted by convention to allele values (number of repeats) obtained from microsatellite (STR) markers, as described by the Y Chromosome Consortium (YCC).”

[3] A full discussion of genealogical time frame is beyond the scope of this undertaking. ISOGG defines it at http://www.isogg.org/tree/ISOGG_Glossary.html as “A time frame within the last 500 up to 1000 years since the adoption of surnames and written family records. An individual's haplotype is useful within this time frame and is compared to others to help identify branches within a family.”. A general-use definition starts it with widespread adoption of surnames – a practice whose dates which vary with cultures, from as early as 2,500 BCE for the Chinese to as late as 1811 CE (AD) for the Dutch. For most of Western Europe, it is considered to begin sometime from 1350 to 1400 CE. One explication is at http://www.daltongensoc.com/diharchive/6_8_August_2003/text.html

[4] “Daughtered out” means that a male’s progeny includes only daughters and has no sons to which to pass his Y chromosome. The subject of lines with no living direct filial descendants is worthy of exploration. However, it is beyond the scope of this undertaking. 

[5] Source: http://www.bbc.co.uk/history/interactive/animations/population/index_embed.shtml.

It's been commented that throughout this article, there is some confusion of the terms Britain, Great Britain, British Isles and United Kingdom; each has a different meaning. These differences have not (yet) been completely sorted out.

[6] The importance of the Black Death (bubonic plague, Yersinia pestis) – a truly horrible disease – pandemic in any discussion of surnames in Western Europe can not be overstated; it is the single most significant event. It killed an estimated 33% to 40% of the European (& British) population from 1348 to 1350 and recurred in several epidemics until the 19th century. The morbidity & mortality resulted in much social & economic turmoil, attributed by some as the cause for permanent social changes such as universal surnames. See also http://en.wikipedia.org/wiki/Black_Death, which quotes Philip Daileader as estimating English mortality closer to 20% than 40% and some other countries at 60% to 80%.

[7] Source: http://www.ancestry.com/search/locality/dbpage.aspx?tp=3257&p=3251.

[8] We have picked 1355 as the date for the beginning of the genealogical time frame due to its proximity to the first and most deadly Plague attack; the reader is free to choose other dates and work out the mathematics. There are some (unverified) indications that a 1353 edict by England's King Edward III mandated that subjects then without surnames take one.

[9] The confidence level comes from the standard deviation of the growth rates for the various periods considered. The rate averaged 0.91% with a standard deviation of 0.72%. This was the only estimate for which a basis for calculating a standard error was found.

[10] See the U.S. Census Bureau’s summary, http://www.census.gov/genealogy/names/dist.all.last and Surnames of England and Wales, http://www.taliesin-arlein.net/names/search.php.

[11] Due to absence of standardization in ancient spelling – with some variants becoming “standard” – it is usually necessary to consider all spelling variants when relying on census data to estimate A.  The family historian needs to appreciate that many (95%?) of our ancestors were illiterate; the written versions that we see of their names were spelled by others, according to the writers’ idiosyncrasies. (The author, for example, has one ancestral line with a one-syllable surname -- "Cales" -- with more than a dozen spelling variants.) 

[12] Throughout, I will use the Taylor project as one example. The primary reason for this choice is that the necessary data is available to me. The reader is invited to supply his or her own examples.

[13] Source: http://www.census.gov/genealogy/names/dist.all.last. This list is based on only the most common spelling variant. For example, Rodriguez & Rodrigues  are considered separate names.

[14] Sources: http://www.taliesin-arlein.net/names/search_2.php & http://www.statistics.gov.uk/cci/nugget.asp?id=273. 2001 total population was reported as 58,789,194.

[15] Source: “Gazetteer of Markets and Fairs in England and Wales to 1516”, http://www.history.ac.uk/cmh/gaz/gazweb2.html. A market was held weekly; a fair was held annually. Most were held under grants by the Crown – either by warrants or letters; some others were held “prescriptively” (by ancient custom), often pre-dating Norman times.  In a random sample of 33 places named, only two were found to have a grant later than 1350 and both were prescriptive. The mean date of grant for the sample was 1252 with a standard deviation of 62 years.

[16] Editorial opinion/rant: The author has read much of what has been written in regard to the origins, distribution and variations of the Taylor surname. It is mostly junk – unfounded speculation &/or wishful thinking. Leaving aside unattributed quotation & paraphrasing (i.e., circular plagiarism), many pontificate but few bother with sources or evidence. 

[17] The full Gazetteer list of place names was copied and pasted into an Excel spreadsheet and a list of random numbers generated to point to locations on the list. Dates of founding or first grant were then looked up for the indicated places and recorded in the spreadsheet.

[18] The author has found no substantiation of the number of itinerant tailors – beyond vague mentions that they did exist – and no information on the economics of their practices. In other words, we substitute pure guesses.

[19] Rules for declaring a “match”, therefore a group or cluster, vary from one project to another. “Tight” match rules produce a lower match rate and thus a higher estimated A by the a posteriori method. “Looser” rules produce a higher match rate and thus a lower estimate for A.
Taylor project match rules, effective March 2010, are "A high probability of sharing a common male ancestor within a genealogical time frame (since 1350)." This translates to the following types of matches: >=24/25, >=34/37, or >=61/67. (No 12-marker matches qualify under current rules, though previously-declared matches have not been rescinded.)

[20] Group, in our sense, means multiple participants found to have matching DNA. It is not used in the sense of a project.

[21] In addition to the 39 groups matched on the basis a probable common male ancestor within a genealogical time frame, there is a group for the Western Atlantic Modal Haplotype (WAMH) and another for those who differ from WAMH on one marker.

[22] To correct for "finite population" (i.e., A < 10*F), multiply by FPC = sqrt[(A-F)/A-1)].

[23] “Penetration” is used here in the sense of ratio of project participants to the target population. Source: James Irvine, “Towards improvements in y-DNA Surname Project Administration”, ISOGG forthcoming. His definition is "the ratio of the number of yDNA tests for a given surname to the world population thereof.".

[24] See Wikipedia “Binomial Distribution”, http://en.wikipedia.org/wiki/Binomial_distribution,

[25] For those not familiar with mathematical notation: the Greek letter sigma (Σ) indicates summation.  Exclamation points represent factorial numbers; 3!=3*2*1=6; 4!=4*3*2*1=24, etc. Carats, ^, mean “raised to the power of”. Parentheses & brackets indicate order of operations.
≈ means “approximately equal to”; ~ means “approximately.”; <  means “less than”; << means “much less than”; > means “greater than”; >> means “much greater than”.

[26] The formula for cumulative probabilities in the graphs is  Pr(k>=x) = Σ(Pr(k=1 to 10) - Σ(Pr(k=0 to x-1)

[27] An alternative view is thinking of "Singletons as those waiting to be found or discovered to be NPE". The project administrator who wishes to take this view may reduce the singletons component of found lines by an NPE factor, thought to be 3-5%. 

[28] Debbie Kennett wrote on 14 Sep 2010: "From anecdotal evidence and my own observations, it would appear that many surnames in the US have been the subject of founder effects with one emigrant ancestor often accounting for thousands of living descendants. In contrast the lines in the British Isles are often on the verge of extinction, sometimes with only a handful of living descendants.
If therefore you have one line in America with 2000 living males and five lines in England, each of which have only two descendants, if you take a random sample, you are highly unlikely even {t}o sample anyone from the English lines. The problem is compounded because Americans are much more likely to pay for a DNA test. In practice, what this means is that Americans who take a DNA test are much more likely to have a match than their counterparts in the British Isles."

[29] To be sourced & evaluated -- "c.4% of father-son transmissions have 1 or more mutations in 67 markers, c.0.4% have 20 or more mutations."

[30] There are, of course, exceptions -- called non-parental events, abbreviated NPEs. The paternal surname practice originated in China, ca2500 BCE, and became universal in most of Europe by 1400 CE. Late adopters of the practice include the Netherlands (1811) and Scandinavia (late 19th century).

[31] The AG 249 estimate is about 1.04 times current F, yielding a F/A ratio   0.96. For it to prove valid, new participants should experience a match rate of ≈ 96%.

[32] Household size estimates are problematic, particularly across countries and long time spans. They depend greatly on culture and economic situation.
A reviewer (James Irvine) comments on 14 Sep 2010: "Household size: an old bit of work I did says: The ratio of adult males to total population depends on the age chosen for “adult”, and varies from country to country. In USA today 40% of the population are males over about 16. (It so happens that the average household size around the world is 2.5 (www.nationmaster.com > People Statistics > Average size of households), i.e. 40% of the world population are “householders”, and so this arbitrary definition of adult males is numerically similar to the number of householders)."

[33]

[34] This exploration is analogous to the "Drake Equation", proposed by Frank Drake in 1962 to estimate the number of civilizations in the Milky Way Galaxy who we might communicate with. Depending on multiplying together a number of factors whose values were mostly unknown, Drake intended the equation to be merely a means of organizing discussion about research to be performed. He did not expect a final answer.