Other pages & sections of our site:
[Home]  [Y-DNA]   [Contacts   [Groups]  [Haplogroups[Trees]  [Project Blog]  [Special]   [FAQ]
On this page:
 

Relationships of S & G

This page is about possible relationships between S (number of singletons) and G (number of groups) -- both variables being components of F (found lines). This is a further exploration, to see what relationships there might be.

Caveats

Penetration

James Irvine (focusing on Y-DNA surname projects) defines a project's penetration as number of participants relative to world population and assigns the designator P to it.

Penetration for most projects is very low, on the order of 1 to 100 participants per 100,000 bearing the surname. Low penetration may disguise any relationship effects from appearing in actual project data.

Actual Data

Actual data at the moment, consists of -- essentially -- one project. The author knows of no other projects with data for comparison. It is, of course, foolhardy to generalize from a sample of one.

Other projects are invited to contribute data for comparison.

Singletons

It is sometimes assumed (unconsciously?) that singletons are artifacts of either NPEs or insufficient sampling ("penetration"). Let me propose, as an alternate view, the "Last Mohican" perspective that some project participants may represent the last surviving male of their patrilines. If true, singletons are essential for a project to have completed surveyed its surname's Y-DNA.

Update Nov. 2016: The above hypothesis holds true only for project members with direct p[aternalancestry of the surname. It does not hold for those who claim maternal relationship to the surname.

Symbols & Notation

Review:

F = G + S

Where F = found lines (an integer), G = number of groups (not total membership of groups, also an integer), S = number of singletons (an integer).
Definitions: A group consists of two or more participants with matching Y-DNA, according to the project's match rules. A singleton is a participant with no matches found within the project.

New Symbols

 T. 

Limits of G & S

S & G are inversely related. For equal project size, if G (through some magical process) is to increase, S must decrease. And if (through another magical process) S is to increase, G must decrease.

To remove the "magical process",  we might imagine different projects with the same target population, but different samplings.

Let’s call the project size N, for number of participants; a change in project size is ΔN; and average Group size is GS. There are three hypothetical scenarios illustrating the limits of S & G.

In the real world, N is not fixed but, mostly & hopefully, increases over time. For ΔN,

At the extremes of F/A,  (0 <=  F/A <= 1):  

At low values of N:

  • At N=1, G=0, S=1,
  • N=2, 0<=G<=1, 0<=S<=2
  • N=3, 0<=G<=1, 0<=S<=3
  • N=4, 0<=G<=2, 0<=S<=4
  • N=5, 0<=G<=2, 0<=S<=5
  • N=6, 0<=G<=3, 0<=S<=6
  • N=7, 0<=G<=3, 0<=S<=7
  • N=8, 0<=G<=4, 0<=S<=8
  • N=9, 0<=G<=4, 0<=S<=9
  • N=10, 0<=G<=5, 0<=S<=10

Gmax follows this law: For even numbers Gmax = N/2 and for odd numbers Gmax = (N-1)/2.
And, Smax = N.

Rates of change between S & G

From the definition of a group as >=2:

Projects with differing target populations

For projects of different target populations, is there any reason why there should be any S-G relationships between one project and another.  Should Adair Y-DNA behave the same as Baker or Cruse?

High penetration levels

Until now, we have assumed low penetration levels which fit with present experience, that is P<<<1, in a range of approximately 10-5 < P < 10-3, . Within this range, the population being sampled is statistically "infinite" relative to sample size.

As N → W (for world population)  -- i.e., P→ 1 -- we might expect G & S to behave differently than at present low penetration levels. This supposes participation in the thousands for rare surnames, in the hundreds of thousands or millions for common surnames.

At P=1 or P≈1, what is the relationship between N, S & G?

At 0.5>= P <=0.9, what is the relationship between N, S & G?

This is a sampling-without-replacement problem in which the sample is large in relation to the population being sampled. .Binomial probabilities apply.

Imagine a bucket of balls of different colors: The number originally in the bucket is W, the number of colors is A (number of ancestral lines), and the number of balls already picked (selected & examined) is N. What are the chances that a ball picked at random will match one or more of the balls previously picked?

With success defined as matching one of the existing groups or singletons -- thus creating a new group --
Pr(k=x), for exactly "k" successes (matches) in "n" trials (new participants):

Pr(k) = n!/[k!*(n-k)!] * (F/A)^k * (1-F/A)^(n-k)

Let k = ΔN = 1.
As n-k → 0, (n-k)! → 1