Measuring Haplotype Rarity
Footnotes & End Notes
This page contains footnotes in the original article, endnotes here.
- No participants of any project are identified, nor are individual results
discussed except to list minima and maxima. We are here looking for general
patterns & trends, for which individual identities are not needed.
- A haplotype is a particular pattern of allele values for the Y-STR
markers tested; it consists of a list of marker names and their associated allele
- “Patriline” here refers to shared direct paternal ancestry or “paternal lineage”.
- A “close match” is here defined as one falling within the FTDNA reporting windows of 1:12, 2:25, 4:37, or 7:67 where the number before the colon represents them number of mismatches and the number after the colon the number of markers compared.
- Exploring these topics is beyond the scope of this article.
- A haplogroup is a phylogenetic classification, here of Y chromosomes. Many haplotypes are found in a haplogroup.
- “Genealogic time” has been defined in various ways: as beginning with surname adoption or universal surname adoption (somewhat later than initial adoption) and/or by number of generations for which records of particular persons may be found. The term may have no fixed meaning, but is clearly less than archaeological or historical time.
- "Resolution" here refers to the number of markers which can be compared against other haplotypes. The term is taken from James M. Irvine’s
article, http://www.jogg.info/62/files/Irvine.pdf. .
- The terms “genetic family” and “paternal lineage” or patriline
-- though related -- do not connote the same thing. In this discussion, a genetic family consists of actual persons found in the present, through DNA testing, to share a common ancestor. A paternal lineage or “patriline” is a theoretical construct; it springs from a founder and exists through time from the distant past. A genetic family will share a patriline, but the theoretical number of pratrilines is not limited to the number of genetic families found.A patriline is a direct paternal/filial lineage consisting of the common ancestor
of all present direct filial descendants and the intermediate paternal ancestors.
- A STR (short tandem repeat) marker is also known as a microsatellite or locus. It consists of repeating sequences of the four DNA components A, C, G, & T. The repetitions of the sequences are counted and reported as allele values.
- This number (494) is for all haplogroups and not solely R1b, as of 1 Mar 2015.
- The project, incidentally, does not rely on exact matches to find genetic paternal families. It uses a standard of haplotype similarity, rather than identity.
- The mode is one measure of a distribution’s central tendency, the most common value observed; it is often seen near two other central tendency measures, the mean (average) and median
- There may be a correlation between marker mutation rates and frequency distributions of values. We have not done this analysis.
- Source: Leo Little,
- A modal value is the mode for each marker, that value which is the most common found.
- There is data available to project administrators which is not available to others.
- By means of successive Internet searches; thank you, Google.
- Wheaton Surname Resources, LESSON 14: More with the Y, “Common vs. Rare Haplotypes”,
- Casey’s website, www.rcasey,net, discussed the idea and method briefly on a few pages. The site was inaccessible for a short time in 2015 but has been brought back, much revised; see
http://www.rcasey.net/casdnaf.htm and find “overlapping haplotypes”. However, Kelly Wheaton expanded on the approach in the resource cited above.
- There may also be within the haplotype rare markers, e.g., a 2nd copy of DYS19, 3rd DYS385 copy or 5th, 6th or 7th DYS464 copy.
- See note 19 above
- Caution should be exercised in using Little’s frequency distributions. Some haplogroups were not well-represented in the data available to him and sample sizes were small for some haplogroups, especially in markers #s 38-67. No data was published for markers #s 68-111. Further, frequencies of less than 1% are not well-represented.
- The Y-STR values were obtained from the projects’ Family Tree DNA, Y-results pages, https://familytreedna.com, for each project and supplemented with data from World Families Network, http://www.worldfamilies.net/.
- The prior article used only 408 R1b participants of the Taylor project. The present data set expands the sample size by an order of magnitude.
- This subtraction may have been Casey’s original intent; whether isn’t clear because it isn’t explained by him on his site and his communications with Wheaton are lost. We will, therefore, term this the “Wheaton method” for its fuller explication.
- A mode can not be calculated because there is no unique “most
- The differences in these distributions is clear in Little’s presentation
- This categorization appears to have based on a distribution of scores. However, its derivation is undocumented.
- We give Kelly Wheaton credit here for having published a fuller explanation than Casey.
- The 1,953 R1b results represent about 65% of all project participants who t4estyed at least 67 markers.
- Others may choose smaller percentages for categories at the ends of the scale.
- The differences in frequency distributions may in turn be related to marker volatility (mutation rate). This subject is beyond the present scope of this study.
- A standard deviation (symbol σ, pronounced sigma) is a measure of a
distribution’s spread; tightly-grouped distributions have small standard deviations
and looser-grouped distributions have larger deviations. It is calculated by the
formula √[Σ(x-μ)^2/(n-1)]. Here, we calculated by means of an Excel formula:
Where the formula result was zero (0), 10-3 was substituted to avoid division by zero
- WAMH = Western Atlantic Modal Haplotype, the most common
haplotype found in western Europe and the Americas.
- The method has not been evaluated for subclades of R1b or other haplogroups.
- Leo Little’s published data rounds frequencies to nearest percentage point; frequencies less than0.005 (½%) but (presumably) greater than 0.001 (0.1%) are reported as zero (0) and less.
- The smaller the p value, the greater the significance; p is the probability that correlation is not due to random chance. An online calculator at http://vassarstats.net/rsig.html was used to compute significance levels.
- This part of the data set consisted of 408 12-marker results, 25-marker, 37-marker, and 196 (48%) 67-marker results
- For 196 data pairs, any correlation coefficient >0.274 yields p<0.0001.
- This data requires access to participants’ match lists. It has taken years of collecting and updating the numbers of each member’s reported matches to attain the capability for this analysis.
- The 37-marker set was chosen because 80% of the sample had tested 37, whereas <50% had tested 67.
- Note, however, that no participants are identified or identifiable. Even in the raw data, identifying information has been disguised.
- In fact, as the data shows, a Ratio Index of 5.0 indicates a haplotype much rarer
than one scoring 1.0.
- "A complete absence of matches" means that no other
haplotypes in the FTDNA database are identified as within the FTDNA
reporting windows at any resolution, that is, for
Although, the windows' sizes increase with marker set size, the numbers of
results in the database decrease.
About 10% of Taylor project participants have no matches reported at any
- 12 markers: Exact match
- 25 markers: 0-2 steps of genetic distance
- 37 markers: 0-4 steps of genetic distance
- 67 markers: 0-7 steps of genetic distance
- 111 markers: 0-10, not discussed here