Sections:
Appendices
  1. Data set
  2. Wheaton data
    1. Whtn Avg: Mkr data
  3. Ratio Index data
  4. Deviation Index data
  5. Calculation examples

Tables
  1. Sample size
  2. Wheaton statistics
  3. Wheaton Interpretation
  4. WApM stats
  5. WApM Interp
  6. Ratio Index stats
  7. Ratio Interp
  8. Deviation Index stats
  9. Dev'n Interp
  10. Correlations
  11. Cross-comparability
  12. Cross-rankings
  13. Matches vs. rarity
  14. No Matches vs. rarity

Figures
  1. YCAIIb
  2. DYS449
  3. DYS390
  4. Wheaton metric
  5. Theory vs Actual
  6. WApM index
  7. Ratio Index scores
  8. Deviation Index
  9. ?
  10. Wheaton
  11. Ratio
  12. Deviation
  13. Scatter plot
  14. Scatter plot
  15. Scatter plot
  16. Scatter plot: Whtn vs. Matches
  17. Scatter plot: Ratio vs. Matches
  18. Scatter plot: Devn vs. Matches
  19. Avg. Matches vs. rarity
  20. Max Matches vs. rarity
  21. No Matches vs. rarity

Expanded Pages:
 

Measuring Haplotype Rarity

This page and those linked from it address the subject of
Y-DNA STR haplotype[1]  rarity and the obverse side of the coin, haplotype commonness. It assumes a basic knowledge of
Y-DNA.

The discussion draws on a study of eight large DNA projects for high-frequency, multi-origin surnames conducted in Spring 2015 by Ralph Taylor, administrator of the Taylor Family Genes project. [a] Some data (that available only to a project administrator) comes solely from this one project.

Summary

A Y-STR haplotype[1] can be measured in terms of its commonness or rarity, two sides of the same coin.The more common a haplotype is, the more similar it is to other common haplotypes. The rarer a haplotype is, the more distinctive it is from others.

We define this aspect in terms of two characteristics:

  1. The commonness/rarity of the marker/allele values comprising the haplotype; and
  2. The commonness/rarity of some markers.
    • Certain markers -- such as DYS19b, DYS385c and DYS464efgh -- are uncommon or rare in their very existence.

Measurement is based on the frequency in the population of the marker/allele values in the haplotype. We compare the percentage of men who do not have the particular value for that marker to the percentage who do not have the most common value. (It's best to limit scoring to within a haplogroup.)

Quantifying this aspect, the measurements fall along a spectrum which can be divided into broad categories:

  1. Very Common, the most common (lowest-scoring) 5%;
  2. Common, the next lowest-scoring 20%;
  3. Average, the middle 50%;
  4. Uncommon; the next highest 20%
  5. Rare; the highest-scoring 5%.

Three metrics (plus a variation) are discussed and designated "Wheaton", "Ratio", "Deviation" and "Wheaton Average per marker". For example, the Wheaton method would score a R1b value of 14 on DYS393 as follows:

Then (depending on the type of measurement) we sum or average the comparisons across all markers tested to get  a total score for the haplotype.

Scoring more markers results in higher scores but can be accounted for. One way is to divide the total score by the number of markers scored to get an average per marker. This average is very roughly comparable across different marker sets. 

With a sufficiently large and representative sample of scores, we can begin to see where a particular haplotype falls on the scale. Here's what we got for a Wheaton average per marker;

Wheaton Average per Marker

Category 12
markers
25
markers
37
markers
67
markers
Very Common 0-0 0-4 0-6 0-6
Common 0.1-6.0 4-8 6-8 6-8
Average 6-16 8-16 8-14 8-12
Uncommon 16-24 14-18 14-18 12-20
Rare >24 >18 >18 >20
 
You can download a tool (an Excel spreadsheet) to calculate your own rarity scores by clicking on this link. {Note: This feature is supported only by Chrome 14.0+, Firefox 20.0+ & Opera 15.0+. It is not supported by Internet Explorer or Safari.}

Introduction

We explore and review quantification of Y-DNA STR haplotype [1] commonness and rarity. We will discuss issues related to measuring the phenomenon and four measurement methods.

In a sense, almost every Y-haplotype is unique if sufficiently tested. However, genetic genealogy relies on similarities between haplotypes. It could be helpful to the genetic genealogist to know the extent to which similarities may be causal (shared patrilines[2]) or coincidental.

Some recent developments in genetic genealogy lead to a need for quantitative measures of Y-STR haplotype commonness and the obverse side of the same coin, rarity.

Measuring a phenomenon objectively and quantitatively is a first step in understanding it. This article attempts that step.

Implications

An obvious implication of common haplotypes is that, in order to properly identify patrilines, those with them need higher resolution[7] levels and perhaps more SNP testing than those with uncommon or rare haplotypes. Conversely, those with rare haplotypes may achieve the same goal with less testing.

Other implications affect grouping into paternal genetic families. Members of the same genetic family should have similar (though not necessarily identical) scores on any commonness/rarity measure.

And, apparent matches between those with common haplotypes may be coincidental rather than reflect shared patrilines[8].

This discussion is expanded here.

Measurement Problems

Most men’s haplotypes consist primarily of modal values for their haplogroup. Only a few of their STR markers will display unusual values. Measurement of commonness vs. rarity depends on highlighting the less-common values.

See this page for more discussion.

Background:

The author first became interested in this subject as a newly-minted DNA project administrator. Project participants were asking “Is something unusual in my DNA?” Guidance for answering was sparse but the questions persisted.

He also noticed that a sizable fraction of men had hundreds of “close matches” at high resolution levels and another fraction had none. Over time, he came to view “rare” haplotypes and “common” haplotypes, not as unique phenomena, but possibly as points on a spectrum.

Robert B. Casey and Kelly Wheaton had made important contributions. Casey had the intellectual insight, and Wheaton published a fuller discussion of the concept online.

For more, see this page.

Data Set and Collection

For this article, concepts were developed and initially tested with data solely from the Taylor Family Genes project. Then, further analysis was carried out including data from eight large DNA projects for high-frequency surnames. The data set was comprised of published[23] Y-STR values for eight (8) surname DNA projects: Smith, Johnson/Johnston, Williams, Brown, Jones, Miller, Anderson, and Taylor[24]. See Appendix A for more information.

The eight projects were chosen for the characteristics of English-language high-frequency surname of multiple origins and large project

From 7,468 participant haplotypes collected in all haplogroups, the R1b haplogroup was subjected to further analysis by number of markers tested. This R1b data set consists of

<
Table 1
R1b Sample Sizes
  12
markers
tested
25
markers
tested<
37
markers
tested<
67
markers
tested<
n = 4,940 4,203 3,824 1,953
n% 100% 85.1% 77.4% 39.5%

The decreasing numbers of participants as markers increase reflect actual levels of testing; less than 40% of participants in these projects tested to 67 markers. As will be seen below, their haplotypes presented a wide range of commonness vs. rarity.

Data was collected in late May and early June 2015 by copying and pasting Y-results tables into Excel spreadsheets. Unique identification numbers were then assigned. Names, kit numbers, paternal ancestor and country of origin information were removed to protect individual privacy before the STR data was pasted into another Excel spreadsheet combining all projects for calculations.

Marker Set, Nominal Size:

See also the expanded section.

The terms “marker set” and “nominal size” are used for convenience; we shall define them. Four marker sets are discussed; their nominal sizes and make-up consist of

Due to possibilities of extra marker copies and some variability in markers tested, nominal sizes of marker sets are not necessarily the number of markers tested and scored. .

“Extra copies”:

Additional copies (beyond the usual) of DYS19, DYS385 & DYS464 are sufficiently abnormal that their existences -- more so than allele values -- are indicative of haplotype rarity. Due to reporting conventions, the values can never be less than that of the highest usual copy. Therefore, we limit determination of extra copies’ rarity to simple existence, rather than calculate based on allele frequency distributions.

Calculations:

Calculations were carried out by computer spreadsheets (MS Excel©), using index formulas. These formulas looked up, for each marker in each haplotype, the frequency of the alleles value and compared to the most common value for the marker. (Specifics of the comparison differ with the type of metric investigated.)

The calculations are not easily described in text but examples of marker scoring, “common”. “average” and “uncommon” haplotypes are presented in Appendix E.

What are we measuring?

measurement metaphor

We are looking at only one aspect of a haplotype: its distance from the center of its haplogroup’s universe of haplotypes.

Assume a point in space represents the point of divergence of haplotypes from each other. Imagine the universe space as a circle, sphere or n-dimensional hyper-sphere. In the two-dimensional diagram on the right haplotypes A, B, & C may differ in other respects, but we measure only the distances from the center.

This concept is consistent in all the measurement systems described here.

Measurement Tool

Download a tool to measure a single haplotype here.

Wheaton Method

See expanded page.

In Kelly Wheaton’s publication of Casey’s method[29], the percentage of men who hold the modal value is subtracted from the percentage who do not hold the individual’s particular value.

Calculate this metric by

  1. Determining, for each marker, the percentage of men who do not have the particular value.
  2. Determining, for each marker, the percentage of men who do not have the modal value.
  3. Subtracting B from A to get a “net” score.[26]
  4. Summing scores for the set of tested markers.

Figure 4 shows the distributions of the resulting scores for four levels of resolution. See Appendix B for details.

Distributions of Wheaton Scores
Figure 4

Table 2
Wheaton Score Summary Statistics
Statistic 12
markers
25
markers
37
markers
67
markers
Average (Mean) 142 292 430 651
Standard Deviation 89.6 128.2 144.9 192.0
Minimum 0 0 39 220
Maximum 946 1,186 1,521 2,228
Median 129 283 415 636
Mode [26]   # N/A # N/A # N/A # N/A
n= 4,940 4,203 3,824 1,953

Advantages of Wheaton’s method include

However, there are disadvantages:

Wheaton partially overcame the first disadvantage following an e-mail message from Casey[28] (for R1b, 67 markers) with this categorization:

As discussed in the “Evaluation” section below, we found these interpretations to be in error.

Evaluation:

Lacking Casey’s or Wheaton’s data, we presume they found this system of categories appropriate for their projects. However, assessment using all R1b STR results in the data set showed a somewhat different picture more generally.

Theoretical vs. Actual Distribution
Figure 5

We calculated “Wheaton scores”[29] for all 1,953 67-marker STR results of the eight projects in the R1b haplogroup[30], using the methods outlined above. We then compared the theoretical distribution for the actual 67-marker score distribution, as shown in Figure 5.

The actual distribution of scores did not match the Casey/Wheaton model. What was proposed as “uncommon” (500-700) for 67 markers included the median score (637) in our data set.

The most important statistics in Table 2 above are the median scores; one-half of scores are higher and one-half lower. Thus, the median denotes the mid-point on any scale! An  “uncommon” haplotype should score much higher than the median.  For a fuller discussion, see Scale-setting.

We propose the five-category interpretation in Table 3, based on the observed data.

Table 3
Revised Wheaton Score Interpretation
Category Pct. 12
markers
25
markers
38
markers
67
markers
Very Common ~5% 0-0 0-99 0-249 0-349
Common ~20% 1-74 100-199 250-349 350-499
Average ~50% 75-175 200-349 350-424 500-749
Uncommon ~20% 176-249 350-499 425-649 750-899
Rare ~5% ≥250 ≥500 ≥650 ≥900

Wheaton Average per Marker (WApM)

Distributions of Wh. Avg./Mkr
Figure 6

The problem of comparability across marker sets led us to consider an “average per marker” metric, derived by dividing the haplotype Wheaton score by the minimum of

See Appendix B2 for details of the distributions. The effect is to bring the distribution curves, as displayed in Figure 6, into similar ranges on the X axis. Note that peaks occur at similar X values (8-12) for each marker set.

Summary statistics are

Table 4
Wheaton Average per Marker
Statistic WApM
12
WApM
25
WApM
37
WApM
67
Average (mean) 11.85 11.69 11.62 9.78
Standard Deviation 7.46 5.27 3.93 2.89
Minimum 0.00 0.00 1.05 3.29
Maximum 78.83 81.69 42.98 34.45
Median 10.75 11.32 11.24 9.54
Mode 0.00 11.80 13.16 9.37
n= 4,940 4,203 3,824 1,953

Again, the median defines the mid-point of the scale. We can roughly interpret thus:

Table 5
WApM Score Interpretation
Category WApM
12
WApM
25
WApM
37
WApM
67
Very Common 0-0 0-4 0-6 0-6
Common 0.1-6.0 4-8 6-8 6-8
Average 6-16 8-16 8-14 8-12
Uncommon 16-24 16-22 14-18 12-20
Rare ≥24 ≥22 ≥18 ≥20

Despite the additional calculation step, this is an easier measurement to interpret. Differences between marker sets are less and they relate to differences in frequency distributions of the markers making up the sets. [32]

Taylor Ratio Index

See expanded page.

We considered a modification of Casey’s concept, to attain greater comparability between marker sets and, hopefully, between haplogroups. -- using a ratio of an individual’s score to the modal score. The most common possible haplotype (e.g., WAMH) will score exactly one (1) and less common haplotypes will have higher scores. However, a haplotype score of 5.0 is not five times as rare as one scoring 1.0.[43]

This metric is calculated by

  1. Determining, for each marker, the percentage of men who do not have the particular value.
  2. Determining the percentage of men who do have the modal value.
  3. Dividing A by B to obtain a ratio.
  4. Averaging scores across marker sets.


Figure 7: Taylor Ratio Index Scores

Figure 7 displays the distributions for the ratio indices. See Appendix C for distribution details.

Distributions for -- respectively, 25, 37 & 67 markers -- are highly similar; their frequency peaks all occur at the same index value, showing the intent of this system.

However, also note steep slopes on left tails and peaks at low indices (<1.35); these aspects suggest the ratio index is imprecise in distinguishing very common from common haplotypes, especially at lower resolutions (<37 markers).

Summary statistics of the Ratio Index:

Table 6
Ratio Index Summary Statisitcs<
Statistic 12 mkr. 25 mkr. 37 mkr. 67 mkr.
Average 1.62 1.65 1.60 1.62
Standard Deviation 1.59 0.94 0.68 0.46
Minimum 1.00 1.00 1.00< 1.00
Maximum 78.3 40.0 27.9 9.9
Median 1.41 1.43 1.44 1.49
Mode # N/A< # N/A # N/A # N/A
n= 4,941 4,203 3,824 1,953

Interpretation of the Ratio Index:

Table 7
Ratio Index Interpretation
Category/th> 12 markers 25 markers 37 markers 67 markers
Very common =1.000 1.0-1.1 1.0-1.2 1.0-1.2
Common >1-1.1 1.1-1.3 1.2-1.3 1.2-1.3
Average 1.1-1.6 1.3-1.6 1.3-1.8 1.3-1.8
Uncommon 1.6-1.8 1.6-2.6 1.8-2.4  1.8-2.4 
Rare ≥1.8 ≥2.6 ≥2.4 ≥2.4

Advantages of this “Taylor ratio index” are

Disadvantages:

Comment:
Attempting to more clearly differentiate scores, we tried a modification – squaring the ratios and then taking the square roots of the sums. It proved not worth the complication; only minor differentiation was seen.

Deviation Index

See expanded page.

We also considered an index based on standard deviations[33] from the mode. This metric was developed specifically to highlight uncommon and rare haplotypes.

It is calculated by:

  1. Determining a standard deviation for the frequency distribution of each marker; this will be less than 1, ranging from near zero to ~0.17. (To avoid division by zero errors, take the maximum of the standard deviation or 0.001, 10-3.)
  2. Determining the absolute difference between the marker modal value and the subject value, resulting in numbers ≥0.
  3. Dividing difference (B) by standard deviation (A), again ≥0.
  4. Summing across all markers in the marker set.
  5. Dividing sums by nominal marker set size.


Figure 8: Deviation Index

Figure 8 displays the distribution graphs. See Appendix D for details. Note the additional peaks, at >=16, for the 37- and 67-marker sets; these were due to this method’s tendency to emphasize less-common values. 19% of 12-marker scores, 2% of 25-marker scores, 27% of 37-marker scores and 28% of 67 marker scores are greater than 16.

Resulting scores, however, do not directly relate to probabilities. They are simply indicators of how common or rare the haplotype. Lower scores are more common; higher scores more uncommon. If scores seem abnormally high, it is because each allele difference from a modal value represents many standard deviations.

Summary statistics of the Deviation Index:

Table 8
Deviation Index Summary Stistics
Statistic 12 mkr. 25 mkr. 37 mkr. 67 mkr.
Average )mean) 10.31 7.27 16.57 9.52
Standard Deviation 9.41 6.48 19.26 7.53
Minimum 0.00 0.00 1.00 1.05
Maximum 371.50 292.56 372.09 50.93
Median 9.19 6.50 9.62 6.07
Mode # N/A # N/A # N/A # N/A
n= 4,940 4,203 3,824 1953

Comment: The deviation method highlights uncommon and rare haplotypes. It performs poorly at distinguishing common from average-rarity haplotypes.

Interpretation of the Deviation Index:

Table 9
Deviation Index Interpretation
Category 12 markers 25 markers 37 markers 67 markers
Very common =0  0-2 0-4 0-2
Common >0, <4 2-4 4-6 2-4
Average 4-14 4-10 6-16 4-10
Uncommon 14-22 10-20 16-40 10-22
Rare >22 >20 >40 >22

Advantages:

Disadvantages:

Disadvantages appear to outweigh the advantages.

Comparison

See also the expanded page.

Comparing the three measurement methods we see these distributions:


Figure 10

Figure 10a: WApM

Figure 11

Figure 12

They appear very different in Figures 10-12. Are they as different as they look? Do the systems consistently -- from one to another -- identify haplotype positions on the spectrum of commonness/rarity? Scatter plots partly answer,

Scatter Plots for 67 Markers

Figure 13

Figure 14
scatter plot
Figure 15
Scatter Plots for 37 Markers

Figure 16

Figure 17

Figure 18

Between Wheaton & Ratio scores, the plots appear to show a very rough positive correlation. Between Wheaton & Deviation, the plots appear to show two or three bands, each positively but slightly correlated. Between Ratio & Deviation, the plots show bands inverted to the former.

Correlation

We also attempted to assess comparability of the three systems using Pearson’s correlation. These results were obtained:

<0.0001 <0.0001
Table 10
Correlation of System Scores<
  12 mkrs 25 mkrs 37 mkrs 67 mkrs
n= 4940 4203 3824 1953
  Wheaton Ratio Wheaton Ratio Wheaton Ratio Wheaton Ratio
Wheaton   * 0.35   * 0.42   * 0.44   * 0.58
Ratio 0.35   * 0.42   * 0.44   *  0.58
Deviation 0.52 0.36 0.52 0.30 0.25 0.080 0.25 0.031

Significance [37]
Wheaton * <0.0001 * <0.0001 * <0.0001 * <.0001
Ratio <0.0001 * <0.0001 * <.0001 * <0.0001 *
Deviation <0.0001 <0.0001 <0.0001 <.0001 <0.0001 <0.0001 <0.0001 0.0854

Meaning:

Rank Correlation

With Taylor data only[38], we also ranked haplotypes by scores in each system in order of lowest scores first; tied scores were assigned the same rank and subsequent ranks adjusted to account for the ties.

This produced larger correlation coefficients than with the above correlation of actual scores. We concluded:

Cross-comparability across marker sets

Again with Taylor data only, we investigated the question of whether the measurement systems were consistent across marker sets, Did haplotypes score the same at different numbers of markers? It was, of course, possible to look at only the 50% who had results for all marker sets.

At first glance, scores seemed inconsistent across the marker sets. However, correlation coefficients were mostly high, ranging from a low of 0.179 to a high of 0.998. The least significance observed  was p=0.0123[39] (Wheaton: 25 vs. 67).

Table  11
Comparability Across Marker Sets<
n= 196 Wheaton Ratio Deviation
12
mkrs
37
mkrs
12
mkrs
37
mkrs
12
mkrs
37
mkrs
12 markers   * 0.665   * 0.680   * 0.755
25 markers 0.731 0.931 0.703 0.984 0.998 0.695
37 markers 0.665   * 0.680   * 0.755   *
67 markers 0.233 0.265 0.285 0.341 0.695 0.923

  Significance
12 markers   * =0.001   * <.0001   * <.0001
25 markers <.0001 <.0001 <.0001 <.0001 <.0001 <.0001
37 markers <.0001   * <.0001   * <.0001 <.0001
67 markers 0.001 =0.0002 <.0001 <.0001 <.0001   *

Rank Correlation

Haplotypes were ranked by scores for each marker set in each of the three methods and then the rankings were compared. Correlation, coefficients ranged from 0.181 to 0.9384 and least significance was p=0.0113,

Table 12
Cross-comparability of Rankings,
n= 196 Wheaton Ratio Deviation
12 mkrs 37 mkrs 12 mkrs 37 mkrs 12 mkrs 37 mkrs
12 markers   * 0.588   * 0.662   * 0.211
25 markers 0.685 0.890 0.759 0.938 0.776 0.210
37 markers 0.588   * 0.662   * 0.181 0.249
67 markers 0.214 0.316 0.285 0.324 0.211   *
Significance
12 markers   * <0.0001   * <0.0001   * =0.011
25 markers <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 =0.001
37 markers <0.0001   * <0.0001   * =0.011   *
67 markers =0.0027 <.0001 <0.0001 <0.0001 =0.003 =0.0004

Comment:

While the correlations are significant, at least some is due to inclusion of smaller marker sets within larger; the 67-marker set includes the 37-, as 37- includes 25-, etc. It is specifically not suggested that scores for markers 1-12 are related to scores for markers 13-25, 26-37 or 38-67.

Do the measurements correlate with number of matches?

We would expect low-scoring haplotypes to have more matches in the FTDNA database than high-scoring ones. Is the expectation borne out?

We were able to analyze this question only for Taylor project participants; we did not have access to the needed data for the seven other projects.[40] We analyzed match data for the 325 Taylor R1b project participants who’ve tested 37 or more markers.  We are wary of offering this because we suspect it is biased, reasons for joining.

For each participant, the following items were recorded:

  1. Number of intra-project (with other project participants) matches;
  2. Number of extra-project (with non-participants) matches.
  3. Whether participant had no matches found in the FTDNA database.

The sum of A & B above represents the total number of matches in the FTDNA database.

The data is summarized in Table 13.

Table 13: Taylor Project
Number of Matches per Participant at 37 Markers
Category  Count Average Maximum
Intra-
project
Extra-
project
Total Intra-
project
Extra-
project
Total
Very Common 24 1.88 29.00 30.88 7 106 108
Common 80 2.14 28.10 30.24 14 284 284
Average 124 2.30 20.54 22.85 16 390 398
Uncommon 76 1.35 14.52 15.87 8 211 212
Rare 21 1.19 10.71 11.90 4 128 128
All categories 325 1.94 21.05 22.91 16 390 398

Distribution Graphs


Figure 19: Average Matches vs. Rarity

Figure 20: Maximum Matches vs. Rarity

For better visualization in Figures 19 and 20, intra-project matches are graphed on the left vertical axis; extra-project and total matches on the right vertical axis.

In Figure 19 the curves differ for average number of intra-project and extra-project matches per participant. Extra-project and total matches decline consistently as rarity increases. Intra-project matches increase until haplotypes reach average rarity, then decline. In Figure 20, the three curves follow similar patterns.

Scatter plot diagrams for 37- & 67-marker scores,[41]] also illustrate:

Wheaton Ratio Deviation
scatter plot of matches vs. Wwheaton scores
Figure : Wheaton score vs. number matches

Figure : Ratio Index vs. number matches

Figure : Deviation Index vs. number matches

We expected to see a pattern in which matches increased as scores decreased, i.e., a downward slope from left to right. Such a pattern is not evident in the plots; we see, instead, linear trend lines with positive slopes. The data appear to refute the hypothesis, though not to statistical significance.

However, those with high numbers of matches (>100) do fit under certain score limits: 800 for Wheaton, 2.0 for Ratio and 100 for Deviation.

No Matches

More illuminating, perhaps, may be the percentages of men whose haplotypes have a complete absence of matches reported in the FTDNA database[44], as shown in Table 14 and its accompanying graphic Figure 21.

Table 14:
No Matches
Category Count W/ No
Match
Percent
1. Very Common 24 2 8.3%
2. Common 80 7 8.8%
3. Average 124 8 6.5%
4. Uncommon 76 12 15.8%
5. Rare 21 4 19.0%
All Categories 325 33 10.2%

Figure 21: Participants with no matches

The data show a correlation coefficient = 0.828 between rarity category (given a numerical rating) and absence of matches. This correlation accounts for 68.5% of variation in the data.

Summary of measurement correlation with number of matches:

We’ve looked at scatter plots, intra-project and extra-project matches, as well as no matches, but were limited to just those for whom we had match data. These data provide tantalizing hints, but do not satisfactorily answer the question posed.

The hints include:

A clearer signal might be gained with a larger and broader sample. We suspect inherent noise in the data obscures detection.

Conclusions

Measuring the commonness or rarity of Y-STR haplotypes is possible but can be difficult and has no ready answers. There may be no perfect measurement tool.

Some measurement tool, though, is helpful to understand the situations with which a project is dealing. Where do haplotypes fit on a spectrum of common to rare? How likely is it that reported close matches may be due to coincidence? For example,

We posit that multi-origin surnames (those with many founders) display a wide variety of haplotypes, ranging from very common to exceedingly rare. This hypothesis is borne out by the expanded data set but needs further testing. We invite other projects with wide diversity in Y-DNA (such as for common and multi-origin surnames) to conduct their own assessments.

A five-point scale (very common, common, average, uncommon, rare) appears a reasonable way to interpret the scores resulting from any such measurement into an easily-digested, common-sense meaning. However, the scale’s categories must be based on a sufficiently diverse sample to have general application.

The scale for measuring commonness vs. rarity must be adjusted to the particular marker sets. Scores on any of these systems will be dependent on the markers used and will be higher with more markers than fewer.

The specific method proposed by Kelly Wheaton (as conceptualized by Robert B. Casey) is the mathematically simplest of the methods discussed here and appears appropriate to haplogroup R1b -- provided that interpretations of the scores into the five-point scale are modified as in Table 3 above. Though involving another calculation step, we believe that the “Wheaton Average per Marker” method is better; it yields numbers more comparable across marker sets.

The ratio index method is simple of interpretation but more mathematically challenging. The author intends to develop a Web-based tool to handle the math; in the meantime, an Excel tool is available on request.  

The Deviation Index is the most difficult of the three, both to obtain and interpret. It performs poorly at discriminating between average and common haplotypes. Its sole advantage is to highlight uncommon and rare haplotypes.

For each of the systems, cross-comparability across marker sets is reasonable considering that the mix of “fast” (volatile) and slow markers varies across the sets.

Ultimately, we are unable to satisfactorily answer questions of how much haplotype commonness vs. rarity contributes to numbers of haplotype matches. Answers will require far more data, coupled with robust analysis.

 

Acknowledgements

We stand on others’ shoulders; without their work this project would not have been possible. To give credit where it’s due:

 Thank you Kelly Wheaton for publishing a discussion and demonstration on your Wheaton surname project site’s “Beginner’s Guide to Genetic Genealogy”, https://sites.google.com/site/wheatonsurname/beginners-guide-to-genetic-genealogy/lesson-14-more-with-the-y. Public accessibility was critical to proceeding to investigate.

 Thank you Robert B. Casey for your insight (at www.rcasey.net) that haplotype commonness and rarity could be measured by component markers and the rarity or commonness of values. It represents an intellectual breakthrough.

 Thank you, posthumously, Leo Little for publishing tables of marker value frequencies by haplogroup, laying the foundation for Casey.

We give thanks to the eight projects’ administrators who published their Y-STR data. Its public availability was crucial.[42].

Thank you, too, to colleagues who reviewed and commented on early drafts. The advice of James Irvine, Debbie Kennett, Dr. Maurice Gleason and Brian Swann has been instrumental in improving this study.

Any errors or misconceptions, however, are the sole responsibility of the author.