[an error occurred while processing this directive]
Grouping is perhaps the most important task of the TFG admin team and can be the
most trying. This page is written particularly to assist successors to the
current TFG admin & co-admins but may be of help to other surname project admins. The
emphasis here is not on specific technical details, but an approach to
dealing with those details.
What is grouping?
Grouping, known to FTDNA as "subgrouping", is the process of
determining that two or more project members belong to the same (Taylor)
genetic family or patriline, assigning them to it and all the associated
that a match meeting project standards exists between two or more
project members. This is most often done upon new results being posted and
evaluated by the admin.
- At least one of these members must have the Taylor surname;
we do not determine genetic families for other than the project
- Adding the new match to the genetic family by means of the "Member
sub-grouping" feature of the GAP.
- If a newly found group, this requires adding the group to the list of possible choices
and setting its formatting.
the new match to the genetic family's page on the project website.
- If a newly found group, this requires creating the page and updating
the index of groups.
- For an example of such a page, see
on the Taylor website. Other projects may have different ways of
doing the same thing.
- Informing group members, new and prior, of the actions taken. (See
- Re-analyzing group characteristics in light of the new data.
- These include modal haplotype (rarely changes), cladograms, etc.
If a project member does not significantly match at least one other Taylor,
he is placed in a sub-group for his major haplogroup. There are 20 of these
"waiting for match" sub-groups :
- One for each macro-haplogroup other than R1b, i.e., A, E*, E1b, G,
I*, I1, I2, J1, J2, N, O, Q & R1a
- Within R1b, one each for "Niall of the Nine Hostages" haplotype,
WAMH haplotype, R-M269 P312 or U106 not determinable, R-U106, R-P312, 12
markers only, and non-Taylor paternity.
- Occasionally, an "Ungrouped" sub-group may appear in the Y-results
table; it contains members for whom a grouping decision is pending.
Similarity between haplotypes is the primary key to grouping decisions.
The general principle is that more similar haplotypes are more likely to
share a common direct paternal ancestor within more recent time than
There are exceptions and qualifications to this principle. They will be
An admin is well-advised to keep current with the grouping task. If left unattended
for too long, it can grow to become an overwhelming mess.
In the event a large grouping backlog should occur, a way to handle it is
- Stop the pile from getting bigger. Handle current issues as they occur.
- Make a plan to eliminate the backlog. Estimate the problem's
size and the resources needed. Include, if relevant,. priorities for parts
of the backlog
- Execute the plan faithfully. Devote some time every day until the
pile has disappeared.
To understand the rest of the discussion, let's define some terms. (Also
see the Glossary.)
- Resolution -- means how many markers are available for
comparison: 12, 25, 37, 67 or 111. In general, higher resolutions are
better than lower.
- Match list -- means the list of "close matches" shown by
FTDNA for a member at a specific resolution on a specific day. Match
lists may range from thousands of close matches to a few, or even none.
- Haplotype -- means the quantitative description of ySTR
allele values for each marker, the entire string of markers and their
- Mode, modal -- The mode is the most frequent value of a
group, sample or population; it, along with the mean (average) and median, is a
measure of central tendency and the most useful for ySTR because it is
not affected by outliers. Modal is the
adjective form, commonly applied to the entire string of marker/allele
- Window, threshold -- refer to the genetic distance
criteria FTDNA uses for reporting "close
matches" on a match list.
Resolution means how many markers have been tested and can be compared.
Resolution may be compared to a physical description of an individual; the
more data as to the person's appearance, the easier it is to identify that
There are 495 Y-STR markers, of which it's currently possible to feasibly
test and compare 111 with FTDNA. (Some other copanies report more markers
but this data isn't searchale in the FTDNA database.)
About 90-95% of 12-marker matches (including exact matches) do not qualify
as genealogically significant. This resolution level is roughly comparable
to "All cats look alike in the dark"; it does not provide enough genetic
information to distinguish one patriline from another. In our experience,
12-marker matches tend to fall apart when/if higher resolution comparisons are
The 5-10% exceptions are instances in which rare marker values appear in the
first 12 tested. Such values may be considered analogous to a unique
birthmark or tattoo. But, unlike birthmarks, Y-STR values aren't known until
tested; we therefore don't recommend the 12-marker test.
The 12-marker panel was developed from genetic anthropology & population
genetics applications for distinguishing
populations. It's highly suitable for that task but less so for distinguishing genetic families.
About 15% of Taylor Family Genes members have tested no higher than 12 markers.
Each has been contacted multiple times, recommending an upgrade; it is
now unlikely that more than a few of these "legacies" will upgrade.
About 50% of 25-marker matches are not genealogically significant; they
suffer many of the same problems as their little brothers. It is recommended
to advise project members that 25-marker matches have low confidence and to
evaluate at higher levels when possible.
The 25-marker test was introduced as an improvement, which it was.
However, it proved only marginally better than 12 markers and was shortly de-emphasized.
About 5% of Taylor Family Genes members have tested to 25
markers but no higher.
This is the lowest resolution at which we have reasonable confidence in genealogical
significance; that a close match indicates a common ancestor within
The added markers (#s 26-37) are the most volatile tested.
CDY, with its two copies, is especially volatile (~1:28
generations) and is sometimes discounted as a mismatching marker.
About 85% of Taylor Family Genes members have tested to 37
markers. (This seems to be a much higher percentage than for the entire
There is some debate among experts as to the confidence one
should place in 37 ySTR markers; an "error rate" of up to 80% has been
claimed. However, we think this hasn't been adequately substantiated and
it is not what we've observed.
This resolution -- by nearly doubling the amount of data for the
haplotype description -- adds yet more confidence to matches and helps to
eliminate some coincidental matches.
Volatility of the added markers (#s 38-67) is unpublished. They appear
to have, overall, about the same mutation rates as #s 1-25 and less than
Marker #66 (DYS492) is worth special mention. It is highly predictive of
whether a R-M269 man (70% of Taylors) is in the R-P312 subclade or R-U106. A
value ≤12 indicates R-P312 and ≥13 indicates R-U106 with 95% confidence.
About 50% of Taylor Family Genes members have tested to 67 markers.
Again, more markers → more confidence in matches. The 111-marker
also be useful in determining branches within genetic families. (We haven't
seen enough test this high to be sure.)
About 15% of Taylor Family Genes members have tested to 111 markers.
The simple cases
Some calls are easy. The easy calls might even make up the majority of
Resolution (# markers compared) is adequate to be
confident in a conclusion; genetic distance is small enough to indicate a
fairly recent CMA; the surnames match; haplogroups match. All is good.
For these simple cases, the admin needn't necessarily turn to more
Markeradvanced techniques. Standard methods work and advice is bountiful on the Internet.
Not so simple
Some calls aren't so easy. Sometimes, it seems that exceptions to the
rules outnumber the cases following the rules. It is with the exceptions
that an admin "earns his salt".
Problems may be lurking under the surface of apparently simple cases
We'll deal with some of those here.
Surnames don't match
Within our project, it seems unusual for the names on a match list to all
agree with that of the subject project member. Usually, there will be at least
one match with a different surname.
In this type of case, one name is much more frequent on the match list than others.
Using a sort of "majority vote" criteria, we usually take this to mean that the predominant name is probably the
ancestral one for the patriline and that the other names are the results of
If the predominant name is the project surname, we conclude the others are
If not, the subject's name probably results
Many surnames, none dominant
There may be dozens of surnames on the match list and no standouts. Due
to the possibility of sampling bias, we can not use their frequencies to
draw any conclusions.
This could arise from Scots or Irish clan relationships. Or, it could be that
the patriline originates from a place where surnames were adopted late.
More and more we are seeing matches between project members, neither of
which has the project surname. The easiest resolution is to ignore it;
slightly more helpful is informing the members.
How close, really, is a "close match"?
Family Tree DNA uses a series of
genetic distance (GD) windows to report what
it calls "close matches". Those matches falling within the window are reported
on match lists and termed close matches. Anything falling outside the
windows are not reported. The windows are
- 12 markers -- GD = 0 steps for non-project members; GD ≤ 1
between project members.
- 25 markers -- GD ≤ 2 steps
- 37 markers -- GD ≤ 4 steps
- 67 markers -- GD ≤ 7 steps
- 111 markers -- GD ≤ 10 steps
This is not a perfect system, though it may be the best feasible one and
is far better than not being able to search for any matches or getting a
too-long list. In our looks at
its efficacy, it seems to identify ~95% of the genealogically meaningful
matches that exist in the database while failing to identify only ~5%.
That isn't bad performance for a rough-cut screening criteria. If it were a medical test it would
be evaluated with this kind of matrix
|Close Match||Not Close Match
|Not genealogically meaningful
Not all the matches that appear on a match list are meaningful in the
genealogic sense of a shared patriline. Some are merely coincidental, a function of the fact that
each STR marker can take only a limited number of values.
False positives are seen in 5% to 10% of TFG members; we believe they are
mostly at the outer edges of the FTDNA genetic distance thresholds. They may be
more frequent in those with common haplotypes.
A suggested means of dealing with false positives is to apply more stringent
criteria if the number of listed matches gets unusually large.
Sometimes, not all the matches that are genealogically meaningful may
appear on match lists. The match lists are built on genetic distance, a very
crude measure of relatedness.
Up tot 5% of genealogically meaningful matches may exceed the genetic
distance criteria by having more distances on volatile markers. Those within the
project can be found using the Y-Genetic Distance feature of the GAP,
followed by a TiP calculation
- Log onto the GAP and navigate to Y-DNA Genetic distance under the
Genetic Reports tab.
- Select the appropriate resolution.
- Select the subject member.
- The list will be ordered by steps of genetic distance from that member,
closest to more distant.
- Click the TiP icon to the right of each other member's name. This will
calculate mutation-rate adjusted TMRCA probabilities.
This procedure only works for project members. It will not find matches with non-members.
This topic was the subject of a
study conducted in Spring 2015. To oversimplify, men with more common
haplotypes are more likely to have false positive matches. The more common
haplotypes appear to be closer to modal haplotypes for the respective
Member A matches B & C, but B & C don't match each other
Member A is a "tweener"; his haplotype falls between B & C. Is he a
previously missing link in a genetic family? Or, are there really two or
three genetic families?
One project admin (with only a few and very large genetic
families for his surname) simply compares everyone to the group modal
haplotypes. If they sufficiently match the modal, they're in the group. If not, they're out.
In TFG, we'd assign A, B, & C to the same group because we can
not assign A to more than one.
With small genetic families, a modal haplotype becomes problematic.
A haplogroup conflict (e.g., R-P312 or a subclade vs. R-U106 or a
subclade) invalidates any ySTR match. Despite what any TMRCA calculator may
say, these men can not share a common direct paternal ancestor for thousands
of years, on the order of 100+ generations.
These men should not be grouped into the same genetic family. However, the
problem is often not apparent unless/until both (all?) have undergone sufficient
SNP testing is gaining in popularity and becoming more refined. A predictable
consequence is that previous grouping decisions will turn out to have been
With advances in ySNP testing, we are beginning to find "haplogroups" that
originated with the genealogical time frame, mere centuries ago. These
shouldn't be considered "mismatches" as much as defining branches within a
One may think of confidence in a purely statistical sense or simply as
being sure of one's ground for conclusions. In any event, confidence is a
For example, I have no confidence in (most) 12-marker matches (or
non-matches) and very limited confidence in 25-marker matches. At 37-markers, I
begin to have confidence that a match reflects a reality of shared patrilines
and, to a limited extent, the recency of the CMA.
Past grouping errors
Our understanding of DNA has advanced rapidly since genetic genealogy
testing became common ca 2003. Some decisions made then would not
be made in the same way today, with more sophisticated knowledge. There are two
basic ways of handling past errors:
Let sleeping dogs lie
There will be instances (e.g., inadequate resolution) in which the
grouping precedent should be respected. It can be argued that overturning a
decision calls for stronger & clearer evidence than went into the original
decision. If the evidence is unlikely to exist or is ambiguous, it may be insufficient
to justify reversal.
There may also be instances where the project's credibility would be
called into question by not reversing a prior decision. These should be
corrected and appropriate members informed of the new decision and the
reasons for it.
Documentation and genetics conflict
Sometimes, a member's documentation contradicts what the genetic data show.
The usual advice is "Documentation trumps genetics"; it is based on genetic
data usually requiring a probabilistic interpretation and documentation
standing on its own merits.
I take a somewhat contrary view -- that genetic data is objective and
so-called "documentation" often reflects subjective biases. I've seen too many
instances where documentation was misinterpreted, misapplied or just made up.
I'll take a haplogroup mismatch over a great-aunt Ellen's story every day of the
Each admin is free to establish his/her own standards but I would not assign
a person to a Taylor genetic family based on a genealogy. The DNA must match.
Because we've learned more about yDNA and because more members yield a
better picture, we can now identify problems with judgments and decisions
made in the past. There are groupings which don't really make up a single
genetic family (patriline). We now suspect each of these may be more than
It is problematic to undo past wrongs by questioning previous grouping
decisions. To do so, one needs a solid justification and a better grouping.
While this is theoretically attainable, it isn't always practically
This group (of currently 17 members) was made by overlapping
matches: A matched B, who matched C, but C didn't match A. It shows a wide
spread in genetic distance from the modal member (#258447) to the furthest
from the modal -- 7 @37 markers, 15 @67, 17 @111. Six of these have done any
SNP testing, all R-L21 or downstream of it; all are within R-M269.
Sorted by closeness to modal:
|Kit||Name|| 12m || 37m || 67m || 111m||Hgrp
I now think this may be -- not a single patriline -- but two (or more?).
How to divide is the question. A TiP matrix may help. It is also a
candidate for more ySNP testing by all members of it.
R1b-022, with 15 members, also consists of overlapping matches; it seems
worse. (Working on data presentation.)