Other pages & sections of our site:
[Home]  [Y-DNA]   [Contacts   [Groups]  [Haplogroups[Trees]  [Project Blog]  [Special]   [FAQ]
On this page:

Measuring Haplotype Rarity

Data Set & Collection

For this article, concepts were developed and initially tested with data solely from the Taylor Family Genes project. Then, further analysis was carried out with data from eight large DNA projects for high-frequency surnames. The data set was comprised of published[23] Y-STR values for eight (8) surname DNA projects: Smith, Johnson/Johnston, Williams, Brown, Jones, Miller, Anderson, and Taylor[24]. See Appendix A for more information.

The eight projects were chosen for these characteristics:

From 7,668 participant haplotypes collected in all haplogroups, the R1b haplogroup was subjected to further analysis by number of markers tested. This R1b data set consists of

Table 1
R1b Sample Sizes
n = 4,940 4,203 3,824 1,953
n% 100% 85.1% 77.4% 39.5%

The decreasing number of participants as markers increase reflects actual levels of testing; less than 40% of participants in these projects tested to 67 markers. As will be seen below, their haplotypes presented a wide range of commonness vs. rarity.

Collection procedure

Data was collected in late May and early June 2015 by copying and pasting Y-results tables into Excel spreadsheets. Unique identification numbers were then assigned. Names, kit numbers, paternal ancestor and country of origin information were removed to protect individual privacy before the STR data was pasted into another Excel spreadsheet combining all projects for calculations.

Marker Set, Nominal Size:

The terms “marker set” and “nominal size” are used for convenience; we shall define them. Four marker sets are discussed; their nominal sizes and make-up consist of

Due to possibilities of extra marker copies and some variability in markers tested, nominal sizes of marker sets are not necessarily the number of markers tested and scored. A nominal marker set of 12 markers may be as many as 14 and 25 may be as many as 30. On the other hand, 37 could be as few as 27 if only some markers #s 26-37 are known.

The 111-marker set is not discussed due to absence of frequency distribution data for these markers.

“Extra copies”:

Additional copies (beyond the usual) of DYS19, DYS385 & DYS464 are sufficiently abnormal that their existences -- more so than their values -- are indicative of haplotype rarity. Due to reporting conventions, the values can never be less than that of the highest usual copy. Therefore, we limit determination of extra copies’ rarity to simple existence, rather than calculate based on allele frequency distributions.

For example, one instance of DYS19b was found in the R1b data set, two of DYS385c and 28 of DYS464e.


Calculations were carried out by computer spreadsheets (MS Excel© ), using index formulas. These formulas looked up, for each marker in each haplotype, the frequency of the alleles value and compared to the most common value for the marker. (Specifics of the comparison differ with the type of metric investigated.)

The calculations are not easily described in text but examples of marker scoring, “common”. “average” and “uncommon” haplotypes are presented in Appendix E.


What are we measuring?

measurement metaphor

We are looking at only one aspect of a haplotype: its distance from the center of its haplogroup’s universe of haplotypes.

Assume a point in space represents the point of divergence of haplotypes from each other. Imagine the universe space as a circle, sphere or n-dimensional hyper-sphere. In the two-dimensional diagram on the right haplotypes A, B, & C differ in other aspects, but we measure only the distances from the center.

This concept is consistent in all the measurement systems described here.