Table of Contents: List of Figures:
  1. 50 most common names
  2. Percent of matches
  3. Haplogroups
  4. X2 contributions
  5. X2 contributions detail
  6. Total matches by type
  7. Total by number markers
  8. Matches by type & surname agreement
  9. Pct. agreement by type & surname
  10. Pct. agreement by name & quality band
  11. Agreement by number markers
  12. Agreement by match type
  13. Agreement by quality band
  14. Agreement by name & quality band
  15. 25-marker matches by number
  16. Agreement by number matches & quality band
  17. Agreement by number matches & match type
List of Tables
  1. Population Description
  2. Probability of no association
  3. Data collection format
  4. Phase 2 data summary
  5. Data by quality band
  6. X2 test for HO
  7. Name agreement by haplogroup & quality band
  8. Name agreement by number matches & quality band

Y-DNA and Surname Association

by Ralph Taylor, project administrator for Taylor Family Genes
with assistance of co-administrators, Lalia Wilson and George West

Revised to: 15 October 2011


Abstract:

This paper examines the association between surnames and Y-chromosome DNA STR matches for participants in the Taylor Family Genes project Its goal was to test the degree of truth to a commonly-expressed belief that these phenomena are strongly associated with each other.1 An initial phase, of matches by participant, established that a positive association exists, but is weak. A second phase tested the association more rigorously, by type and quality of match and found, generally, a weak association but stronger for participants with small numbers of matches.

The data show that Y-chromosome DNA is less strongly associated with specific surnames than is oftnt0en believed. Surname differences may account for less than 25% of variance in Y-DNA haplotype patterns for those with more-common surnames.

The paper also presents concepts and methodologies for other investigations of the association .


Introduction:

Turi E. King’s and Mark A. Jobling’s article, “Founders, Driftnt0, and Infidelity: The Relationship between Y Chromosome Diversity and Patrilineal Surnames”2 is a ground-breaking development on this subject. They studied the relationship of surnames to Y-haplotype similarity in Britain and found an inverse correlation between surname frequency and haplotype similarities, a strong relationship for rarer surnames and a weaker relationship for more common surnames. However, they did not find a similar correlation in reviewing the 2006 McEvoy and Bradley study of Irish surnames and Y-DNA. 3 The current study differs from King’s and Jobling’s in that it focuses primarily on one, high-frequency surname.

Also, in a 1999 article for the Florida State University Law Review4 , Chris W. Altenbernd reviewed legal ambiguities and pointed to increased genetic testing as indicating a need for statutory reform. He proposed new terminology to replace pejoratives – “marital child” for the biological father being the mother’s husband, “non-marital child” for unmarried mothers and “quasi-marital child” for children with a biological father other than the mother’s lawful husband at the child’s birth. He contends that ”The term ‘paternity’ should only refer to biological fatherhood.”

That surname and Y-DNA do not have a generally strong association for a common surname is a statement which may provoke incredulity within the genetic genealogy community. Such statements require proof. However, data from one DNA surname project shows only weak association between surnames of participants and DNA matches found. There is, in fact, more disassociation than association at most match qualities.

Analyses of data from Taylor Family Genes (a DNA surname project of Family Tree DNA) were conducted relative to Y-DNA matches and both project membership and members’ surnames. These are described under Methods.

The data show that disassociation between surname and DNA goes beyond the allowances usually made for non-paternity events for matches of lesser quality than 37 markers with a genetic distance of less one or zero and 25 markers with genetic distance (GD) of 0. For all 25-marker matches and for 37-markers matches with GD >2, disassociation was stronger than could be accounted for by NPE At 37 markers, GD<2, and 25 markers, GD=0, the data became inconclusive as to association.

The study is limited by two factors

  1. A sample size of one DNA surname project is statistically inadequate; it does not demonstrate variation. It is, however, the only data source available to the author in sufficient detail to permit the required analyses. Other project administrators are invited to conduct independent analyses to see if these findings hold up.
  2. The data source may not be representative of the surname studied. The project participants are highly concentrated in North America and other unknown selection may be at work.

Background:

Surnames and their inheritance

A father passing down his surname to his children and his sons passing down the same surname to their children, etc. is now a staple practice in Europe, the New World and other places.5 This practice, though widely believed so, is neither universal nor ancient.6 It dates, in England, to the 11th century for some families and to no earlier than the mid-14th century for most commoners. Some areas of the world do not use surnames, in some family names precede given names, and in a few matriarchal family names are the practice.7

A characteristic of surnames is variety and diversity. Even the highest-frequency name in the English-speaking world, Smith, was carried by just over 1% of the US population in 1990; the 50 most-frequent names, together accounted for less than 14% of the population. Further, the trend appears to be toward increasing diversity. Smith declined to 0.88% in 2000 and the most common 50 declined to 12.4% cumulative.6 7


Figure 1: Frequency of 50 most common surnames

King and Jobling8 found “a remarkably strong relationship between these patrilinearly inherited cultural markers and Y-chromosomal haplotypes.” and “a clear genetic signal of coancestry can be observed.” They attributed this largely to multiple founders for common names, versus single or few founders for rare names. Taylor – 4th in surname frequency in England, 13th in USA – would fall between the names included in their study of Smith (the most common) and King (37th in England).

"Common" names vs. "Rare"

One might ask "What is a 'common' surname, as compared to a 'rare' one?" One measure could be the median frequency; at the median, half the population bears names that are more frequent and half have less frequent names. This mid-point frequency -- in the United States -- is  ~7,000 per million (0.007%). The two US names which comes closest to this median point are Varner (49.996% have more frequent names) and Spangler (50.003% have more frequent names).

The Taylor Surname

Taylor is not the most common surname in English-speaking countries; Smith is more than twice as popular. But Taylor has a high frequency -- among the top few; it is carried by more than one of every 200 English subjects and almost one of every 300 US citizens (331 per 100,000). It ranked 4th in frequency in England in the 1998 Electoral Register and 13th in the US 2000 census.10 It has an occupational origin, coming from the French tailleur for a cutter of cloth. In written records, it is seen as a sobriquet as early as the late 12th century and as an inherited family name from the late 14th century. It has several spelling variants, but most are rare; Tyler is the most common variant (415th in the census at 0.027%).

The number of founders of the Taylor surname is unknown, but is estimated (by a variety of techniques) to range from a few hundred to, perhaps, as many as 2,500.11 Some of those paternal Taylor lines would undoubtedly have been extinguished over the centuries and no longer exist today.

Y-chromosome inheritance

A biological father transmits his Y-chromosome to his son (only males have Y-chromosomes) almost without change. Short-tandem repeat alleles on Y-chromosome loci (markers) change infrequently, an average frequency of once in every 250 to 400 transmission events.12 The organizing principle of DNA surname projects is that common paternal ancestors may be revealed by means of high-quality Y-DNA matches.

In the course of administering the Taylor Family Genes project, the author observed that, while many of the participants had matches, large numbers of these were with neither Taylor-named individuals13 nor non-Taylors who had joined the project15. He formed a general impression that the numbers were too great to be accounted for by recent generations’ non-paternity events. Nor, were they accounted for by a multiple founders theory.

Y-DNA STR matches:

In short-tandem repeats (STR) testing, the number of STR motif repetitions of are counted for a number of loci (markers) on the Y-chromosome; the allele value for each locus is its STR count. To determine whether a match between two men exists, and its quality, the marker-allele values tested in common are compared and their absolute differences summed (or, for some markers, the fact of a difference is given a value of 1). The resulting sum is referred to as genetic distance (GD) and indicates the dissimilarity between two haplotypes to the extent measured.

Another way to think of this is as strings of STR results being words and the marker/allele values being the words’ letters; for example, “tailor” and “taylor” disagree in only one letter. If we assign the value 9 to “I” and 10 to “y”, their distance is 1.

The number of markers it is possible to compare (e.g., letters in the word) and genetic distance work together to determine match quality (similarity of the words). In general, the more markers tested in common the more confident one can be about statements of a shared paternal ancestor and the less the genetic distance the greater the likelihood a shared ancestor was more recent. A pair of men for whom one can compare 37 markers and arrive at GD=2 are more likely to share a common ancestor more recently than a pair with 25 markers, GD=2 or 37 markers, GD=4.

Non-paternity events (NPE):

It is recognized that non-paternity events (an event or series of events resulting in a child not carrying the surname his or her biological father was born with) cause disassociation between a son’s Y-haplotype and his surname. Abbreviated NPE, they include adoption, name change, illegitimacy, etc.15

The rate per hundred births appears to vary, depending on culture and economics. King & Jobling 16 found it to be 1.00% to 4.54% for certain British surnames, with a median nearer the lower figure. A phenotype study in the state of Nuevo Leon, Mexico found it between 9.8% and 13.8% (0.118 ± 0.020). 17 In Switzerland, another study18 put it at 0.3% to 1.3%. A Michigan, USA study 19 had it at 1.4% of white children and 10.1% among black children.

NPE are often undocumented, presenting difficult problems in genealogical research. Genealogical effects of NPE are cumulative through generations, often estimated at 35-40% of participants for many projects, and a possible reason for growth of genetic genealogy as paper trails turn cold.


Methods:

This section describes the methods and procedures employed in the study.

Conceptual considerations:

Choice of measurements

Correlation is not possible

Correlation – a specific type of association between variables – can exist only for scalar quantities. Weight can correlate with height, but hair color (a qualitative variable) can only associate with eye color or other variables.

If there did exist one Y-DNA haplotype for the Taylor surname, another for Smith and yet another for Anderson -- we still could not say that Y-DNA "correlates" with surname. It is not statistically possible.

Measuring association is possible

Names, “cultural markers”, are categorical (nominal or qualitative) variables, rather than numeric (ordinal). The name Adams is neither more nor less than Zaun; they are merely different distinguishing labels, as apples and oranges. We can not measure a linear correlation between names and other variables; we can, though, measure the broader concept – association 20 – and test it statistically. For these measures and tests, we need non-parametric statistics.

Also, any individual European surname is much less frequent than other markers, such as eye color. For example, about 10% of European-ancestry persons have the rarest eye color, green, and less than 1% of English persons have the most common surname, Smith. This fact will have its effect when it comes time to measure.

Note: Re-work using TiP scores

In the second phase, we developed quantified variables, amenable to correlation -- a rank order of match types by quality and percent of matches in agreement with surname. .

Measuring Y-chromosome DNA

Y-DNA haplotypes may be regarded either as nominal variables or as sets of marker/allele values which can, in turn, be treated in either nominal or ordinal fashion. The analysis can be taken beyond Adams ≠ Zaun; similarity and dissimilarity can be quantified. To quantify Y-DNA similarity, we used the “match” concept and considered only limited degrees of similarity as qualifying; in Phase 2, we ranked match types by quality.

June 2013 note: Subsequent information suggests that TiP (a FTDNA mutation-adjusted TMRCA calculator) scores would be a better measure of haplotype similarity than genetic distances.

Study phases

Data Source:

The data was gathered from the Y-DNA matches (and absence of matches) of participants in Taylor Family Genes – a DNA surname sponsored by Family Tree DNA (FTDNA). It is an “open membership” project, meaning that no prior approval or proof is required to join; membership is self-selected. However, the fact that monetary payment is required to purchase a test from FTDNA may bias the membership toward those who have “brick-walled” on their documentary genealogical research.

Total project membership at the time of data collection was 437, of whom 96% are USA residents. 382 had Y-DNA tests and 264 had available results for 37 or more STR markers. These 264 participants formed the basis of the study.

The restriction of project membership to FTDNA clients enabled us to use FTDNA tools for quickly finding matches, without respect to surname, throughout a large database of comparable Y-DNA results.

King and Jobling observe that “sample ascertainment bias (in particular self-selection of men who may be closely related and self-reporting of data) remains a serious and unquantified problem that could affect interpretation.”21 In comment, self-reporting of results is absent here and self-selection bias toward closely-related men would tend to strengthen associations beyond those found. (Self-selection may tend to bias in the opposite direction.)

Population Description:

The study began by eliminating participants without Y-DNA results or fewer than 37 markers of results from sampling. Fewer markers than 25 were considered unreliable for assessing match quality. To summarize the population from which samples were drawn:

Taylor surname? ≥ 37 mkrs Y-DNA Matches
Yes No Any Match Taylor or In- project non-Taylor, non-project Taylor only Non-Taylor only Both Neither
Total 229 34 263 245 148 188 57 97 91 19
Pct. 87.1% 12.9% 100% 92.8% 56.1% 71.0% 21.7% 36.7% 34.3% 7.3%
Table 1: Population Description


Figure 2: Percent of matches

For the surname matched category, the most frequent value (mode) was “non-Taylor only”


Figure 3: Haplogroups within project

Procedures and Definitions:

The data set was restricted to those with at least 37 markers in order to facilitate comparisons in number of matches at successively higher quality levels; of a total of 436 participants, 54 were eliminated for having no Y-STR results, 96 for having only 12 markers and another 23 for having only 25 markers. This gave a qualifying population of 263 participants at the time of the study.

“Index person”: The study involved searching for matches one participant at a time and counting the number of matches whose surnames agreed and disagreed with the participant’s name. Each participant for whom a search was conducted was designated the index person for that search.

Name Variants: Spelling variations (e.g., Taylor, Tailor, Tayler, Taler, Talor) were treated as equivalent. No instances were encountered of foreign-language words with the same meaning (e.g., Schneider). Similarly, spelling variants of other surnames were accepted as equivalent and no foreign-language versions were encountered.

Match: Matches recorded were those reported by FTDNA while conducting member-by-member searches. These were for 25 markers, GD<3; for 37 markers, GD<5; and, for 67 markers, GD<8.

Name Agreement: This is a “surname match”. Matches for each sampled participant at various quality levels were counted by whether the matching person bore the same surname as the participant (“Agree”) or a different surname (“Disagree”). Thirty-four (34) project participants bear a surname other than Taylor; comparison was to the surname the kit donor bears.


Phase 1: By Participant

An initial survey took place from 5th to 8th August 2011 with the object of quantifying relationship, if any, between the Taylor surname &/or project participation and Y-DNA matches.23

Design:

Data was gathered as to two questions:

  1. Did the participant (index person) have a match with one or more project participants (or non-participants bearing the Taylor surname) sufficient to yield a probability of 90% or better of a common paternal ancestor within the past 55 transmission events? (This is the standard adopted by the project for declaring a high-quality match and translates to a 25-marker match with genetic distance less than two or a 37-marker match with genetic distance less than three.)
    And
  2. Did the participant have one or more matches reported by FTDNA with persons who neither were project participants nor bore the surname? (This equates to a 25-marker match with genetic distance less than three or a 37-marker match with genetic distance less than five.)

chi-square equationThe Chi-squared (Χ2 ) statistic24, sometimes described as a “badness of fit” test, yields a confidence level; the worse the data fits a hypothesis, the higher the chi-squared value will be. It is calculated by the equation to the right where O represents observed values and E expected values.

chi-sqaure graph The Χ2 distribution (for more than one degree of freedom) looks something like the graph to the right.25 The small colored area on the right tail represents the remaining probability.

Χ2 = Sum of (observed-expected)2/(expected). The critical values for a one-tailed test are

Critical X2
Values
Degrees of Freedom
1 2 3 4 5
p < 0.10 > 2.706 > 4.605 > 6.251 > 7.779 > 9.236
p < 0.05 > 3.841 > 5.991 > 7.815 > 9.488 > 11.070
p < 0.01 > 6.635 > 9.210 > 11.345 > 13.280 > 15.090
p < 0.001 > 10.827 > 13.815 > 16.268 > 18.465 > 20.517

For p < 0.001, we will reject the null hypotheses.

When Χ2 is inconclusive: A chi-squared test is intended to disprove a hypothesis, but not to prove it. If a null hypothesis is not rejected; that doesn’t necessarily mean it is accepted.

Non-parametric tools:

Names being nominal variables, parametric statistics (such as Spearman’s correlation) are typically not applicable; non-parametric tools to measure the association include:

Statistical Measures

The following statistical tools are available for measuring associations:

Question 1: Does any association exist between surname and Y-DNA?

If there were no association, we would expect matches by members of the project with other Taylors to be no more frequent than the name’s frequency in the general population. Counting the two most common variants (Taylor at 311 per 100,000 US residents and Tyler at 27 per 100,000); this is about 0.338%.

The conceptual problem was to construct mutually exclusive categories for the name/match variable. These categories fit the requirement:

We can asses the hypothesis of no association with a chi-squared test, but it is necessary to first define the expected values:

Chi-Square Calculation
Odds of 1+ matches = 92.8%    
Odds of a match with a Taylor = 0.338% 0.314%
Category Expected Observed |O-E|-0.5 27 O-E)^2 (O-E)^2/E
Taylor-only matches 0.83 57 55.7 3,099 3,743
Non-Taylor only 262.93 105 157.4 24,784 94
Both Taylor & non-Taylor 1.24 91 89.3 7,967 6,414
     chi-squared = 10,251
df = 2   p <= <0.00001
Table 2: Phase 1, probability of no association

Meaning

There is some association between the Taylor surname and Y-DNA matches. There are more matches with Taylor-surnamed men and fewer with non-Taylors than would be expected if the variables were completely independent and this finding is statistically significant. The direction and strength of the association will be explored below.

Aside: Due to the low frequencies of all surnames28, other DNA surname projects are likely to see the same findings.

Question 2: How strong is the association and what is its direction?

Now that we’ve proved an association, the strength & direction of the association is to be found. A positive direction is indicated by an excess of participants with Taylor matches over the expected, random value.

We invented a tool to estimate overall strength of the association; we tried a series of probability assumptions using chi-squared, goodness-of-fit trial calculations. We multiplied the ~0.3% probability of a match with a Taylor (thus altering expected values) by factors of 1, 10, 20, 30, etc. to see -- with the observed actual values -- where Χ2 and its component parts reached their minima. In short, we successively adjusted expectations to see where they most closely corresponded to observations. Better correspondence is indicated by a smaller Χ2 and its contributing components.


Figure 4: X2 contributions

The graph above shows the results of the trials. Note that for multiplier factors <20 some Χ2 values are off the scale. The bottom scale represents the multiplying factors used. At this scale, we see a declining trend in all contributors as the factor increases. We can not tell where the minima occur.


Figure 5: X2 contributions detail

This version of the graph focuses in on the area where X2 minima occur and adds the total X2 series. Total Χ2 reaches minimum at a multiplier of ~80. The Taylor-only category reaches minimum when expected probability of a Taylor match is ~75 times the random p. The non-Taylor category reaches a minimum at ~80. The Both category also reaches its minimum at ~80.

/The Χ2 minima suggest that the odds of a random new member of the project matching others are approximately 80 times greater than the random probability of 0.3%, as follows:

An Excel CHITEST indicated that the probability of these being the correct expected percentages is ~0.93, a relatively high probability.

Association measurements:
Substituting the above percentages for expected values, we can estimate the association.

Summary of Phase 1:

An association between surname and Y-DNA matches does exist within the Taylor Family Genes project. Matches with the Taylor surname are more frequent than would be expected by chance.

The consensus of the measures of association (V=0.0343 and λ=0.00913) is that the association between surname and Y-DNA matches is positive but weak for members of Taylor Family Genes.

The association is so weak that some experts would interpret the V and λ values as “little to none”. “No association”, however, was ruled out.


Phase 2: By Match Type and Quality

Design:

The objective of Phase 2 was to test -- by match quality -- association between the Taylor surname and Y-DNA matches within the project. We may see association become stronger as quality increases.

A simple time-to-most recent-common-ancestor calculator29 determined a rank order for quality of the types of matches examined, from low to high: 2:25, 4:37, 1:25, 3:37, 2:37, 0:25, 1:37, 0:37. (Genetic distance is given first, before the colon; then the number of markers compared; “2/25”, for example, means a genetic distance of 2 over 25 markers.)

The “expected value” for total matches was taken to be the sum of those actually found at the respective match qualities, as we had no better basis for establishing expected match numbers than to conduct participant-by-participant searches and record the number found. Expected values for those which agree or disagree with the index persons’ surname is a portion of this total.

Our null hypothesis30 is that a moderate to strong association exists between surnames and Y-DNA matches; but we need to state that quantitatively in order to test it.

Our alternative hypothesis, HA, is that weak or no association exists between surnames and Y-DNA matches.

Having established our null hypothesis, we can apply a chi-squared test to the observed and expected frequencies. The observed values will be the number of matches actually found in the searches, categorized by whether they agree or disagree with the index person surname, yielding a two-by-two table with degrees of freedom = 3;

By arranging the data by quality of match type and quality band, we have data amenable to correlation; we quantified both variables . We can calculate a correlation coefficient between match quality and the percent of match surnames agreeing with the index persons’.

Quality Bands

Note: Redefine quality bands by TiP scores.

Phase 2 Methods:

Data collection took place 8 August to 18 September 2011. A temporary reference number was assigned to the participants with at least 37 markers of results, to disguise identities from public disclosure.

Data was collected on 264 participants, the entire qualifying population, though only 150 had tested 67 markers. Data collection followed this form:

ID 2:25 1:25 0:25 4:37 3:37 2:37 1:37 0:37
Agr Dis Agr Dis Agr Dis Agr Dis Agr Dis Agr Dis Agr Dis Agr Dis
8 2 1 1 5 5 1 0 0 2 4 1 1 0 0 0 0
46 0 205 0 55 0 4 0 1 0 3 0 1 0 2 0 0
ID 7:67 6:67 5:67 4:67 3:67 2:67 1:67 0:67
  Agr Dis Agr Dis Agr Dis Agr Dis Agr Dis Agr Dis Agr Dis Agr Dis
8 0 0 0 0 1 1 0 2 1 0 1 0 0 0 0 0
46 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Table 3: Phase 2 data collection format

“Agr” means the surname on the match agreed with the index parson’s name; “Dis“ means the surname on the match disagreed with the index parson’s name. An underscore (“_”) indicates no match is possible because the markers were not tested.

Phase 2 data:

The table and graphs below summarizes the Phase 2 data. Detailed data is available upon request or this link.

By Match Type & Quality Quality 1 Quality 2
2:25 4:37 1:25 7:67 6:67 3:37 5:67 2:37 4:67
Agree w/Surname 77 32 137 7 1 50 7 96 24
Disagree 12203 887 2748 564 345 414 208 229 114
Total Matches 12280 919 2885 571 346 464 215 325 138
Participants w/ matches 237 238 237 142 142 239 142 237 237
Participants w/ no matches 26 23 26 8 8 24 8 26 26
  Quality 3 Quality 4  
0:25 3:67 1:37 2:67 1:67 0:37 0:67
Agree w/Surname 145 34 94 19 34 45 5
Disagree 387 56 133 45 21 62 8
Total Matches 532 90 227 64 55 107 13
Participants w/ matches 237 142 237 142 142 237 142
Participants w/ no matches 26 8 26 8 8 26 8
Table 4: Phase 2 data summery

Note: Of the total 263 participants, 26 (~10%) had no matches at 25 markers for any surname. This is a higher fraction without matches than the 8% observed for all matches because five (5) of the 26 had matches at 37 markers.

Graphic representations of the data.

 Figure 6: Total matches by match type

 Figure 7: Total matches by # markers
There was no paucity of total matches, though the numbers declined as number of markers & match quality increased. A small percentage of participants had no matches.
Unsurprisingly, at lower match qualities the numbers are larger and disparities are greater. (The 2:25 disagree number is literally off the chart.) But, even at high qualities, more matches disagreed as to surname than agreed.

Figure 8: Matches by type and surname agreement
Figure 8 to the left -- depicting all matches found, by quality and whether they agree with the index person’s surname -- shows a striking aspect of the data: More matches disagree as to surname than agree. The red bars are taller than the blue bars.

Figure 9: Percentage surname agreement by surname & match type

Figure 10: Percentage surname agreement by surname & quality band


Phase 2 Analysis:

Overall, the association between surname and Y-DNA matches is weak. Only 4% of all DNA-match surnames agreed with that of the index person – far lower than our null hypothesis that at least half would agree. The graph to the right depicts the percentage of match surname agreement by number of markers compared; 25-marker matches have the lowest percentage at 2%; 37-marker matches are highest at 26%; 67-marker matches are next highest at 8%.

surname agreement by # markers compared
Figure 11: Surname Agreement
by number of markers compared

Chi-square values for an expected half of matches agreeing with the index persons’ surnames are 7147, 1079, and 508 respectively and would be greater for an expectation of more than half. With one degree of freedom, these translate to probabilities of approximately zero (0) for the null hypothesis (of moderate to strong association) and support the alternative hypothesis of weak association.

Meaning:

The overall surname/Y-DNA association is weak for 37-marker matches and almost completely absent for 25- and 67-marker matches.

By match quality band:

We next reassembled the data by match quality band32, obtaining

  Quality 1 Quality 2
Match Type 2:25 4:37 1:25 7:67 6:67 3:37 5:67 2:37 4:67
Agree 77 32 137 7 1 50 7 96 24
Disagree 12203 887 2748 564 345 414 208 229 114
Total 12280 919 2882 571 346 464 215 325 138
Pct. Agree 0.6% 3.5% 4.7% 1.2% 0.3% 11% 3% 30% 17%
Pct. Disagree 99.4% 96.5% 95.3% 98.8% 99.7% 89% 97% 70% 83%
Agree Quality 3 Quality 4

 

Match Type 0:25 3:67 1:37 2:67 1:67 0:37 0:67
Agree 145 34 94 19 34 45 5
Disagree 387 56 133 45 21 62 8
Total 532 90 227 64 55 107 13
Pct. Agree 27% 38% 41% 30% 62% 42% 38%
Pct. Disagree 73% 62% 59% 70% 38% 58% 62%
Table 5: Data by quality band

Applying a chi-squared test to our null hypothesis that on-half or more of matches will have surname agreement:

Quality 1 Quality 2
Match Type 2:25 4:37 1:25 7:67 6:67 3:37 5:67 2:37 4:67
Pct. Agree 0.6% 3.5% 4.7% 1.2% 0.3% 11% 35 30% 17%
X2 = 5987 398 1181 272 171 143 94 27 29
p <= <0.001 <0.001 <0.001 <0.001 <0.001 <0.001 <0.001 <0.001 <0.001
  Quality 3 Quality 4

 

Match Type 0:25 3:67 1:37 2:67 1:67 0:37 0:67
Pct. Agree 27% 38% 41% 30% 62% 42% 38%
X2 = 55 2.7 3.4 5.3 1.5 1.4 0.3
p <= <0.001 0.1 0.1 <0.001 0.2 0.2 0.6
Table 6: X2 test for HO: Exp=Obs/2

We may reject the null hypothesis of a moderate to strong association between surname and Y-DNA for all but the following types of matches

Meaning: For most types of matches, the association is proven weak to none. For a few types of matches, the association may be moderate to strong.

Correlation between Match Quality and Surname Agreement:

A pattern emerged, as shown in Figure 12:


Figure 12: Name agreement by match type

As match quality improved, the percentage increased of matching persons whose surnames agreed with that of the index person, reaching as high as 60% for 37-marker matches with genetic distance = 0.

The pattern is seen more clearly here in Figure 13, with match types grouped into four ranked bands:

agreement bymatch  quality band
Figure 13: Name agreement by match quality band

Grouping into bands produces a strong correlation. The square of the correlation coefficient indicates the strength of the relationship between match quality and surname agreement. Our correlation coefficients and their squares are

Meaning:

Surname agreement on Y-DNA matches depends highly on the quality of the match; match quality accounts for more than half the variance in surname agreement.. The better the match, the stronger the association between Y-DNA and surnames. However, this does not mean that a majority of surnames will agree for the best matches; in only one type of match (0:37) did the majority of surnames agree with the index persons’.

Furthermore, association between surname and Y-DNA may not be apparent at lower quality levels. It is nonsensical to treat a 2:25 match in the same way as a 0:67.

Analysis by surname:

We analyzed the data by whether the participant’s surname was Taylor or another.

  Quality  1 Quality  2 Quality  3 Quality  4
Taylor,
n=229
Total Matches 13,899 1132 271 74
Pct. Agree 1.4% 12.7% 35.1% 54.8%
X2 6563 315 34.5 0.63
p <= ~0 ~0 ~0 0.7
Non-Taylor,
n=34
Total Matches 879 356 223 40
Pct. Agree 4% 10% 9% 25%
X2 367 116 73.5 5
p <= ~0 ~0 ~0 0.0025
Table 7: Match agreement by surname and quality band


Figure 14: Name agreement by surname & quality band

We reject the null hypothesis as it pertains to the four quality bands and for most types of matches.
Exceptions:

However, for the match types listed as exceptions, the numbers of matches are small and the apparent association can be affected by anomalies such as ascertainment bias

Meaning: We’ve proven that, overall, association is weak to none but have not proven it for some specific types of matches.

Analysis by haplogroup:

We repeated the analysis by major haplogroup

Quality  1 Quality  2 Quality  3 Quality  4
E,
n=9
Total Matches 49 41 58 8
Pct. Agree 20.4% 9.8% 17.2% 37.5%
X2 8.58 13.3 12.4 0.25
p <= 0.003 0.003 <0.001 0.62
G,
n=9
Total Matches 108 13 25 0
Pct. Agree 1.9% 0.0% 16.0% NA33
X2 50.1 6.5 5.8 NA
p <= <0.001 <0.001 0.011 NA
I,
n=46
Total Matches 2201 258 213 52
Pct. Agree 1.3% 14.0% 32.4% 42.3%
X2 1043 67.0 13.2 0.615
p <= <0.001 <0.001 <0.001 0.43
J,
n=4
Total Matches 60 20 13 0
Pct. Agree 0.0% 0.0% 0.0% NA
X2 30 10 6.5 NA
p <= <0.001 <0.001 0.011 NA
R1a,
n=5
Total Matches 148 11 3 6
Pct. Agree 0.0% 0.0% 66.7% 66.7%
X2 74 5.5 0.17 0.33
p <= <0.001 0.019 0.68 0.56
R1b,
n=190
Total Matches 14,089 1145 601 109
Pct. Agree 1.5% 12.1% 34.4% 50.5%
X2 6627 330 29.1 0.005
p <= <0.001 <0.001 <0.001 0.95
Table 8: Name agreement by haplogroup and quality band

We reject the null hypothesis as pertains to the four quality bands and for most types of matches. Exceptions:


Figure 15: Name agreement by haplogroup
and quality band

Meaning: We’ve proven that, overall, association between surname and Y-DNA is weak to very weak but have not proven it for some specific types of matches in some haplogroups. The caveat about anomalies affecting higher match qualities, due to smaller numbers of matches, remains.

Correlation:

We see a correlation between match quality and surname agreement for some haplogroups.

Analysis by number of matches

We also repeated the analysis relative to participants’ total number of matches at 25 markers. The average number of matches per participant was 59.68 with a standard deviation of 139.6. Not only was there a large variance, the distribution was non-normal.

We stratified the data by total number of 25-marker matches (chosen to avoid double-counting) as follows


Figure 16: 25-marker matches
by number category

Figure 16 shows the average number of matches for these five categories and the numbers for one standard deviation above and below the average.

Quality  1 Quality  2 Quality  3 Quality  4
No matches,
n=26
Total Matches 0 0 0 0
Pct. Agree NA NA NA NA
Low matches,
 (1-10)
n= 118
Total Matches 378 132 246 60
Per Participant 3.20 1.12 2.08 0.51
Pct. Agree 27.0% 63.6% 66.1% 63.3%
Medium,
 (11-60)
n= 68
Total Matches 1853 407 315 69
Per Participant 27.2 5.99 4.63 1.01
Pct. Agree 5.5% 17.7% 28.6% 37.7%
High matches,
 (61-300)
n= 30
Total Matches 3854 225 113 7
Per Participant 129 7.63 4.03 0.43
Pct. Agree 0.4% 1.7% 6.6% 46.2%
Very High,
 (301+)
n= 21
Total Matches 10,552 718 232 33
Per Participant 502 34.2 11.0 1.57
Pct. Agree 0.3% 2.5% 13.8% 42.4%
Table 9: Name agreement by number of matches and Quality band


Figure 17: Name agreement by
number of matches & quality band

Figure 17 depicts clear differences in name agreement relative to number of matches. The L ow-match group shows stronger association with surname and the Medium-match group shows strong correlation between quality and surname agreement.

Here, the collapsing the data into four quality bands disguises within-match-number group relationships; so we include Figure 18, depicting agreement rates at specific match types.


Figure 18: Name agreement
by number of matches and match type

We found these patterns within the data:

Association vs. Match Numbers:

This analysis provided a surprising finding: Surname agreement inversely correlates with number of matches. As the number of matches a participant has increases, the less likely it is that surnames for those matches will agree with his; the association between Y-DNA and surname gets weaker.

For the four quality bands, the correlation coefficient between number of matches and surname agreement ranges from -0.57 to -0.59, with r2 from 0.329 to 0.35. Combining all match types, r=-0.66 and r2=0.44. This means that the number of matches accounts for one-third to 44% of the total variance in surname agreement.

21.5% of participants with high and very high match numbers account for 95% of all matches and those have the least association with their own surnames. Their data tend to “swamp” that from those with moderate and low match numbers.

We speculate that some participants share low-diversity haplotypes34 and most have haplotypes with more diversity. Those with more low-diversity haplotypes are less likely to find matches whose surname agrees with theirs.

Summary of Phase 2:

We have proven false, in most instances, the null hypothesis of a moderate to strong association between surname and Y-DNA. This means that we have proven the alternative hypothesis of a weak to absent association except in these five situations:

  1. Matches of very high quality (1:67, 0:37, 0:67), whether with respect to surname, haplogroup or number of matches;
  2. Haplogroup J, for which no conclusions were drawn;
  3. Haplogroup R1a for high (0:25, 3:67, 1:37, 2:67) and very high quality matches;
  4. Participants with no matches, for which no conclusions were drawn;
  5. Participants with 1 to 10 total 25-marker matches matches, for which surname agreement ranges from an average of 27% at the lowest quality to more than 60% for higher qualities.

We found strong positive correlations between match quality and surname agreement. Generally, as match quality improves, surname agreement rises. Association between surname and Y-DNA -- weak overall -- strengthens with match quality and may reach “strong” at the highest qualities.

We found inverse correlations between a participant’s number of matches and surname agreement; the name/Y-DNA association is stronger for those with few matches than for those with many matches.

We did not perform a multivariate analysis, though one would possibly be informative as to relationships between variables such as surname, haplogroup and number of matches.


Conclusions:

The matter is not as simple as whether a strong association exists between Y-DNA and surname. A good answer to the question must have nuances and complexities. Overall, the association is weak to absent; but exceptions are found.

Surnames are categorical, not numeric; they can not correlate with other variables. They can, however, be associated with variables such as Y-DNA similarity and the association can be measured.

A positive association between surname and Y-DNA does exist, but is often grossly overstated. It is weak within the Taylor Family Genes project and highly dependent on the quality of the matches. It is also inversely dependent on the number of matches a participant has; the more matches, the weaker the association.

Like physics’ gravitational force, Y-DNA/surname association appears to diminish exponentially with distance – here, genetic distance. This is the coin’s other side for Y-chromosome stability; DNA has a longer time view than genealogy often does.

The association is also dependent on a participant’s total number of matches. There is a strong inverse correlation between number of matches and surname agreement. The fewer matches a participant finds, the stronger the name/Y-DNA association he will see; the more matches he finds, the weaker the association. This phenomenon is worth further investigation.

It is less than 1/10 of one percent probable (p<0.001) that a majority of matches will agree with the typical index person’s surname. As to an expectation that >=60% of matches will agree with the index person’s surname, it is improbable except for very high quality matches.

  1. At lower match quality (i.e., ”quality band 1”) the DNA/surname association is so weak as to be almost immeasurable. Matches of these qualities are more likely not to bear the index person’s surname than to bear it.
  2. The association becomes stronger as the match quality increases (i.e., number of markers compared increases and genetic distance decreases).
  3. The association becomes weaker as a participant’s total matches increase.
  4. Participants in DNA surname projects are advised not to restrict searches for matches to their own surnames.

This study has implications for other surname projects, particularly whose origins are occupation, location, color or other physical characteristic. As these tend to be common surnames, they are born by an overwhelming majority of the population.


Appreciation:

Grateful appreciation is owed to Taylor Family Genes co-administrators Lalia Wilson and George West, without whose advice this examination would have been the poorer. However, they deserve no blame for any errors made; those are the author’s alone.



End Notes

  1. “often believed” examples: (1) Sewell Y-DNA Surname Project, http://www.stonepillar.org/: “Furthermore, there is a very high correlation between the Y-DNA and the surname in Western societies.” (2) www.familytreedna.com/public/bigelow: “…the markers on a male's Y-DNA correlate with his patrilineal lineage and surname.” (3) www.worldfamilies.net/what: “y-DNA correlates with the surname, as both y-DNA and the surname are passed down from father to son in patriarchal societies.” (4) Jobling, Mark A. at http://www.le.ac.uk/ge/maj4/surnames.html: “..we expect some correlation between the two, but it is not clear how strong that correlation is likely to be..” Return to main document
     
  2.  Mol Biol Evol (2009) 26 (5): 1093-1102. doi: 10.1093/molbev/msp022 First published online: February 9, 2009 at http://mbe.oxfordjournals.org/content/26/5/1093.full. Return
     
  3. McEvoy B, Bradley DG, “Y-chromosomes and the extent of patrilineal ancestry in Irish surnames”. Hum Genet 2006;119:212-219. CrossRefMedlineWeb of Science  Return
     
  4.  Alternbenrd, Chris W., “QUASI-MARITAL CHILDREN: THE COMMON LAW'S FAILURE IN PRIVETTE AND DANIEL CALLS FOR STATUTORY REFORM”, Florida State University Law Review, 1999, http://www.law.fsu.edu/journals/lawreview/issues/262/alte.html  Return
     
  5.  Re: “staple practice” of surname inheritance/ Wikipedia: http://en.wikipedia.org/wiki/Patrilineality Return
     
  6.  Re: “neither universal nor ancient”. Hartman, Jed: http://www.kith.org/journals/jed/2004/10/08/2333.html. Return
     
  7. "Some areas of the world do not use surnames, in some family names precede given names, and in a few matriarchal family names are the practice” See above. Return
     
  8.  US Bureau of the Census, http://www.census.gov/genealogy/names/dist.all.last (1990) and http://www.census.gov/genealogy/www/data/2000surnames/Top1000.xls (2000). Return
     
  9.  Mol Biol Evol (2009) 26 (5): 1093-1102. doi: 10.1093/molbev/msp022 First published online: February 9, 2009 at http://mbe.oxfordjournals.org/content/26/5/1093.full. Return
     
  10.  “British Surnames and Surname Profiles”, http://www.britishsurnames.co.uk/surnames/TAYLOR , and US Census Bureau, Return
     
  11.  Taylor, Ralph E. at ~taylorydna/resources/explorations/size-vs-unmatched.htmReturn
     
  12.  (1) Walsh, Bruce at http://nitro.biosci.arizona.edu/ftdna/quick.html, “a chromosome is a molecular clock that ticks randomly within a specified rate.”; (2) Kerschner, Charles at http://www.kerchner.com/dnamutationrates.htmReturn
     
  13.  Including both project participants and non-participants Return
     
  14.  Bearing the surname is not a participation requirement in this project. Many of these have joined after discovering matches indicating the possibility of a non-parental event. Return
     
  15.  NPE is the term in most general use. King and Jobling called these “NPT” for non-patrilineal transmissions. Other terms include “IAP” for incorrectly assigned paternity. Return
     
  16.  Mol Biol Evol (2009) 26 (5): 1093-1102. doi: 10.1093/molbev/ [1] Return
     
  17. Cerda-Flores RM, Barton SA, Marty-Gonzalez LF, Rivas F, Chakraborty R (1999). "Estimation of nonpaternity in the Mexican population of Nuevo Leon: A validation study with blood group markers". Am J Physical Anthropol 109 (3): 281–293. Link. Thirty-two (32) legal fathers were excluded as biological fathers in a group of 396 children. Return
     
  18.  Sasse G, Müller H, Chakraborty R, Ott J (1994). "Estimating the frequency of nonpaternity in Switzerland". Hum Hered 44 (6): 337–43. doi:10.1159/000154241. PMID 7860087  Return
     
  19.  Ashton GC (1980). "Mismatches in genetic markers in a large family study". Am J Hum Genet 32 (4): 601–13. PMID 6930820   Return
     
  20.  Wikipedia, http://en.wikipedia.org/wiki/Association_(statistics): : “In statistics, an association is any relationship between two measured quantities that renders them statistically dependent The term ‘association’ refers broadly to any such relationship, whereas the narrower term ‘correlation’ refers to a linear relationship between two quantities.” {This distinction can be found in many other sources, but is best stated in the cited source.} Return
     
  21.  Op. cit. Return
     
  22.  The first digit of the match type represents genetic distance, the second digit represents the number of markers compared. Thus, “2:37” means a genetic distance of two across 37 markers. Return
  23.  Projects for other than surname can use the Phase 2 design -- whether matches agree or disagree with the participant’s name. Return
  24.  Source: National Institute for Standards and Technology (NIST), http://itl.nist.gov/div898/handbook/eda/section3/eda3674.htm  Return
  25.   Source: http://www.medcalc.org/manual/chi-square-table.php Return
  26.    An arbitrary ratio was chosen; however, due to the low expected frequencies, it does not affect the result.   Return
  27.  Yates’ correction for small expected values has only a small effect. Return
  28.   Only Smith exceeds a frequency of one percent.  Return
  29.  Rank order was determined by number of generations to most recent common ancestor at a given probability. Return
  30.  A hypothesis to be disproved by data. Rejection of the null hypothesis is to mean acceptance of the alternative hypothesis. Return
  31.  POL242 LAB MANUAL: EXERCISE 3A, http://homes.chass.utoronto.ca/~josephf/pol242/LM-3A  Return
  32.  Grouping matches into quality bands involved some double-counting of matches. Matches at 67 markers also tend to appear at 37 and 25. Return
  33.  NA = No matches, either agreeing or disagreeing with surname -- division by zero. Return
  34.  Examples include those sharing the “Niall of the Nine Hostages” haplotype. These participants tend to have very high numbers of matches and few agreeing with their own surnames. Their matches tend to show a wide diversity of names. Niall of the Nine Hostages is a legendary Irish character, a High King of Ireland of the 4th and 5th centuries. Return

-- End --