[an error occurred while processing this directive]
On this page:
 

Grouping

Grouping is perhaps the most important task of the TFG admin team and can be the most trying. This page is written particularly to assist successors to the current TFG admin & co-admins but may be of help to other surname project admins. The emphasis here is not on specific technical details, but an approach to dealing with those details. 

What is grouping?

Grouping, known to FTDNA as "subgrouping", is the process of determining that two or more project members belong to the same (Taylor) genetic family or patriline, assigning them to it and all the associated tasks.

It entails:

  1. Determining that a match meeting project standards exists between two or more project members. This is most often done upon new results being posted and evaluated by the admin.
  2. Adding the new match to the genetic family by means of the "Member sub-grouping" feature of the GAP.
  3. Adding the new match to the genetic family's page on the project website. 
  4. Informing group members, new and prior, of the actions taken. (See match letter.)
     
  5. Re-analyzing group characteristics in light of the new data.

If a project member does not significantly match at least one other Taylor, he is placed in a sub-group for his major haplogroup. There are 20 of these "waiting for match" sub-groups :

 

Haplotype similarity

Similarity between haplotypes is the primary key to grouping decisions. The general principle is that more similar haplotypes are more likely to share a common direct paternal ancestor within more recent time than dissimilar haplotypes.

There are exceptions and qualifications to this principle. They will be discussed below.

Snowball effect

An admin is well-advised to keep current with the grouping task. If left unattended for too long, it can grow to become an overwhelming mess.

In the event a large grouping backlog should occur, a way to handle it is

  1. Stop the pile from getting bigger. Handle current issues as they occur.
  2. Make a plan to eliminate the backlog. Estimate the problem's size and the resources needed. Include, if relevant,. priorities for parts of the backlog
  3. Execute the plan faithfully. Devote some time every day until the pile has disappeared.

Terminology

To understand the rest of the discussion, let's define some terms. (Also see the Glossary.)

Resolution

Resolution means how many markers have been tested and can be compared. Resolution may be compared to a physical description of an individual; the more data as to the person's appearance, the easier it is to identify that person.

There are 495 Y-STR markers, of which it's currently possible to feasibly test and compare 111 with FTDNA. (Some other copanies report more markers but this data isn't searchale in the FTDNA database.)

12 Markers

About 90-95% of 12-marker matches (including exact matches) do not qualify as genealogically significant. This resolution level is roughly comparable to "All cats look alike in the dark"; it does not provide enough genetic information to distinguish one patriline from another. In our experience, 12-marker matches tend to fall apart when/if higher resolution comparisons are available.

The 5-10% exceptions are instances in which rare marker values appear in the first 12 tested. Such values may be considered analogous to a unique birthmark or tattoo. But, unlike birthmarks, Y-STR values aren't known until tested; we therefore don't recommend the 12-marker test.


The 12-marker panel was developed from genetic anthropology & population genetics applications for distinguishing populations. It's highly suitable for that task but less so for distinguishing genetic families.

About 15% of Taylor Family Genes members have tested no higher than 12 markers. Each has been contacted multiple times, recommending an upgrade; it is now unlikely that more than a few of these "legacies" will upgrade.

25 Markers

About 50% of 25-marker matches are not genealogically significant; they suffer many of the same problems as their little brothers. It is recommended to advise project members that 25-marker matches have low confidence and to evaluate at higher levels when possible.


The 25-marker test was introduced as an improvement, which it was. However, it proved only marginally better than 12 markers and was shortly de-emphasized.

About 5% of Taylor Family Genes members have tested to 25 markers but no higher.

37 Markers

This is the lowest resolution at which we have reasonable confidence in genealogical significance; that a close match indicates a common ancestor within genealogic time.


The added markers (#s 26-37) are the most volatile tested. CDY, with its two copies,  is especially volatile (~1:28 generations) and is sometimes discounted as a mismatching marker.

About 85% of Taylor Family Genes members have tested to 37 markers. (This seems to be a much higher percentage than for the entire FTDNA database.)

There is some debate among experts as to the confidence one should place in 37 ySTR markers; an "error rate" of up to 80% has been claimed. However, we think this hasn't been adequately substantiated and it is not what we've observed.

67 Markers

This resolution -- by nearly doubling the amount of data for the haplotype description -- adds yet more confidence to matches and helps to eliminate some coincidental matches.


Volatility of the added markers (#s 38-67) is unpublished. They appear to have, overall, about the same mutation rates as #s 1-25 and less than #s 26-37.

Marker #66 (DYS492) is worth special mention. It is highly predictive of whether a R-M269 man (70% of Taylors) is in the R-P312 subclade or R-U106. A value ≤12 indicates R-P312 and ≥13 indicates R-U106 with 95% confidence.

About 50% of Taylor Family Genes members have tested to 67 markers.

111 Markers

Again, more markers → more confidence in matches. The 111-marker comparisons may also be useful in determining branches within genetic families. (We haven't seen enough test this high to be sure.)

About 15% of Taylor Family Genes members have tested to 111 markers.

The simple cases

Some calls are easy. The easy calls might even make up the majority of cases.

Resolution (# markers compared) is adequate to be confident in a conclusion; genetic distance is small enough to indicate a fairly recent CMA; the surnames match; haplogroups match. All is good.

For these simple cases, the admin needn't necessarily turn to more Markeradvanced techniques. Standard methods work and advice is bountiful on the Internet.

Not so simple

Some calls aren't so easy. Sometimes, it seems that exceptions to the rules outnumber the cases following the rules. It is with the exceptions that an admin "earns his salt".

Problems may be lurking under the surface of apparently simple cases too. We'll deal with some of those here.

Surnames don't match

Within our project, it seems unusual for the names on a match list to all agree with that of the subject project member. Usually, there will be at least one match with a different surname.

Predominant name

In this type of case, one name is much more frequent on the match list than others. Using a sort of "majority vote" criteria, we usually take this to mean that the predominant name is probably the ancestral one for the patriline and that the other names are the results of eNPE.

If the predominant name is the project surname, we conclude the others are the eNPE. If not, the subject's name probably results from iNPE.

Many surnames, none dominant

There may be dozens of surnames on the match list and no standouts. Due to the possibility of sampling bias, we can not use their frequencies to draw any conclusions.

This could arise from Scots or Irish clan relationships. Or, it could be that the patriline originates from a place where surnames were adopted late.

Non-project names

More and more we are seeing matches between project members, neither of which has the project surname. The easiest resolution is to ignore it; slightly more helpful is informing the members.

How close, really, is a "close match"?

Family Tree DNA uses a series of genetic distance (GD) windows to report what it calls "close matches". Those matches falling within the window are reported on match lists and termed close matches. Anything falling outside the windows are not reported. The windows are

This is not a perfect system, though it may be the best feasible one and is far better than not being able to search for any matches or getting a too-long list. In our looks at its efficacy, it seems to identify ~95% of the genealogically meaningful matches that exist in the database while failing to identify only ~5%.

That isn't bad performance for a rough-cut screening criteria. If it were a medical test it would be evaluated with this kind of matrix

Actual Situation  Reported
Close Match
Not Close Match
Genealogically meaningful True Positive
(95%)
False Negative
(5%)
Not genealogically meaningful False Positive
(?)
True Negative
(?)
Totals (?) (?)

False positives

Not all the matches that appear on a match list are meaningful in the genealogic sense of a shared patriline. Some are merely coincidental, a function of the fact that each STR marker can take only a limited number of values.

False positives are seen in 5% to 10% of TFG members; we believe they are mostly at the outer edges of the FTDNA genetic distance thresholds. They may be more frequent in those with common haplotypes.

A suggested means of dealing with false positives is to apply more stringent criteria if the number of listed matches gets unusually large.

False negatives

Sometimes, not all the matches that are genealogically meaningful may appear on match lists. The match lists are built on genetic distance, a very crude measure of relatedness.

Up tot 5% of genealogically meaningful matches may exceed the genetic distance criteria by having more distances on volatile markers. Those within the project  can be found using the Y-Genetic Distance feature of the GAP, followed by a TiP calculation

  1. Log onto the GAP and navigate to Y-DNA Genetic distance under the Genetic Reports tab.
  2. Select the appropriate resolution.
  3. Select the subject member.
  4. The list will be ordered by steps of genetic distance from that member, closest to more distant.
  5. Click the TiP icon to the right of each other member's name. This will calculate mutation-rate adjusted TMRCA probabilities.

This procedure only works for project members. It will not find matches with non-members.

Common haplotypes

This topic was the subject of a study conducted in Spring 2015. To oversimplify, men with more common haplotypes are more likely to have false positive matches. The more common haplotypes appear to be closer to modal haplotypes for the respective haplogroups.

Member A matches B & C, but B & C don't match each other

Member A is a "tweener"; his haplotype falls between B & C. Is he a previously missing link in a genetic family? Or, are there really two or three genetic families?

One project admin (with only a few and very large genetic families for his surname) simply compares everyone to the group modal haplotypes. If they sufficiently match the modal, they're in the group. If not, they're out.

In TFG, we'd assign A, B, & C to the same group because we can not assign A to more than one.

With small genetic families, a modal haplotype becomes problematic.

Haplogroup mismatch

A haplogroup conflict (e.g., R-P312 or a subclade vs. R-U106 or a subclade) invalidates any ySTR match. Despite what any TMRCA calculator may say, these men can not share a common direct paternal ancestor for thousands of years, on the order of 100+ generations.

These men should not be grouped into the same genetic family. However, the problem is often not apparent unless/until both (all?) have undergone sufficient SNP testing.

SNP testing is gaining in popularity and becoming more refined. A predictable consequence is that previous grouping decisions will turn out to have been erroneous.

Exception

With advances in ySNP testing, we are beginning to find "haplogroups" that originated with the genealogical time frame, mere centuries ago. These shouldn't be considered "mismatches" as much as defining branches within a genetic family.

Confidence

One may think of confidence in a purely statistical sense or simply as being sure of one's ground for conclusions. In any event, confidence is a relative thing.

For example, I have no confidence in (most) 12-marker matches (or non-matches) and very limited confidence in 25-marker matches. At 37-markers, I begin to have confidence that a match reflects a reality of shared patrilines and, to a limited extent, the recency of the CMA.

Past grouping errors

Our understanding of DNA has advanced rapidly since genetic genealogy testing became common ca 2003. Some decisions made then would not be made in the same way today, with more sophisticated knowledge. There are two basic ways of handling past errors:

Documentation and genetics conflict

Sometimes, a member's documentation contradicts what the genetic data show. The usual advice is "Documentation trumps genetics"; it is based on genetic data usually requiring a probabilistic interpretation and documentation standing on its own merits.

I take a somewhat contrary view -- that genetic data is objective and so-called "documentation" often reflects subjective biases. I've seen too many instances where documentation was misinterpreted, misapplied or just made up. I'll take a haplogroup mismatch over a great-aunt Ellen's story every day of the week.

Each admin is free to establish his/her own standards but I would not assign a person to a Taylor genetic family based on a genealogy. The DNA must match.

Problem Groups

Because we've learned more about yDNA and because more members yield a better picture, we can now identify problems with judgments and decisions made in the past. There are groupings which don't really make up a single genetic family (patriline). We now suspect each of these may be more than one family.

Legacy Question

It is problematic to undo past wrongs by questioning previous grouping decisions. To do so, one needs a solid justification and a better grouping. While this is theoretically attainable, it isn't always practically available.

R1b-006

This group (of currently 17 members) was made by overlapping matches: A matched B, who matched C, but C didn't match A. It shows a wide spread in genetic distance from the modal member (#258447) to the furthest from the modal -- 7 @37 markers, 15 @67, 17 @111. Six of these have done any SNP testing, all R-L21 or downstream of it; all are within R-M269.

Sorted by closeness to modal:

01
KitName   12m     37m     67m     111mHgrp
258447   Taylor 00 01 FGC18451
189171Martin 0 1 1 1 L1065
66126Taylor 0 1 1 2FGC18451
474655Taylor 0 1 1 N/A M269
197635Taylor 0 1 N/A N/AM269
51667Taylor 1 2 2 N/A M269
N82248Taylor 0 3 3 3FGC18451
123690Binns 0 1 4 7 S7364
483160Dickson3 5N/A N/AM269
N1964Taylor2 69N/A M269
127270Brown2 6917 M269
180739Taylor3 6N/A N/AM269
180737Taylor3 7N/A N/AM269
295160Taylor3 7N/A N/AM269
26356
Taylor
2 7N/A N/AM269
B1109Roberts2 715 N/AL21

I now think this may be -- not a single patriline -- but two (or more?). How to divide is the question. A TiP matrix may help. It is also a candidate for more ySNP testing by all members of it.

R1b-022

R1b-022, with 15 members, also consists of overlapping matches; it seems worse.  (Working on data presentation.)