Measuring Haplotype Rarity
Most men’s haplotypes consist primarily of modal values for their haplogroup. Only a few of their STR markers will display unusual values. Measurement of commonness vs. rarity depends on highlighting the less-common values.
Among the problems in measuring commonness vs. rarity is haplotype diversity.
- For example, of 494
in the Taylor Family Genes project, 449 haplotypes are unique and only 31 are shared by
more than one participant. Of 304 67-marker results, 298 haplotypes are unique and only
6 are shared by more than one participant.
- It makes mathematical sense. If each marker could take any of four values (some may take
more than four, some less) 37 markers could result in 1,874,6161 possible haplotypes and 67 markers in 20,151,121 haplotypes. With five possible values, the number of possible haplotypes is 69.343,957 and 1,350,125,107 respectively.
There is also diversity among the STR markers tested; they vary in how tightly observed values concentrate around their
modes. Some markers have “tight distributions” with small variances and some have “loose distributions” with relatively large variances. For example, in haplogroup
Figure 1: Frequency distribution YCAIIb
Figure 2: Frequency distribution DYS449
Figure 3: Frequency distribution DYS390
A third measurement problem concerns haplogroup and subclade diversity. Frequency distributions (and modal values) vary from one haplogroup or subclade to another. A measurement scale for R-U106 is not necessarily appropriate for R-P312.
- Genetic genealogists rely on diversity to discriminate between paternal genetic families but it does make for difficulty in generalizing about the nature of haplotypes.
And, we have a definitions problem: How alike must haplotypes be to qualify as “similar”? Exactly matching? One marker differing? Two? Without a definition of similarity, we can not compare matches to commonness or rarity. When required for this article, we take the “close match” genetic-step reporting windows of FTDNA as our definition of similarity. That is
- At 12 markers, 0 steps (exact match)
- At 25 markers, 2 steps’
- At 37 markers, 4 steps
- At 67 markers, 7 steps
- At 111 markers, 10 steps
There is, too, a dimensional problem. Each marker is free to independently vary (up or down) and thus represents a separate dimension. A haplotype may be viewed as having as many dimensions as markers tested. Attempting to reduce this space to a one-dimension scale ignores this complexity. A thing’s shadow is not the thing.
Therefore, regard this study as abstracting just one aspect of Y-chromosomal
DNA - the extent to which haplotypes resemble others (commonness) or are
Biases surely exist -- both in the reference data we use as a standard and in
the sample data we compare to the reference data. Sources of bias include,
but are not limited to,
- Self-selection -- Testing is voluntary and, in most instances, paid by
- Laws and regulations -- Some governments limit testing, restricting the
extent to which their citizens are represented.
- Over- and under-sampling of patrilines -- Certain subjects may be
recruited to test; others may feel they don't need the informaiton.
- Adoptees and descendants of NPE may be more likely to test
- Absence of sampling control methods -- Outside academic environments,
procedures to assure random or representative samples are virtually
We try, in this study, to balance biases by broadening the comparison
sample but can not fully correct for inherent biases. We can merely recognize
that conclusions drawn are tentative and subject to revision.
Finally but not least, there are issues of perspective and interest. A project administrator can fully examine only the results of his or her own project
and can not readily (nor care to) analyze other projects’ results.
To perceive large patterns, we can not retain blinders but must take a broad
Return to main page.