Other pages & sections of our site:
[Home]  [Y-DNA]   [Contacts   [Groups]  [Haplogroups[Trees]  [Project Blog]  [Special]   [FAQ]
• Item 1
• Item 2

# Relationships of S & G

This page is about possible relationships between S (number of singletons) and G (number of groups) -- both variables being components of F (found lines). This is a further exploration, to see what relationships there might be.

## Caveats

### Penetration

James Irvine (focusing on Y-DNA surname projects) defines a project's penetration as number of participants relative to world population and assigns the designator P to it.

Penetration for most projects is very low, on the order of 1 to 100 participants per 100,000 bearing the surname. Low penetration may disguise any relationship effects from appearing in actual project data.

### Actual Data

Actual data at the moment, consists of -- essentially -- one project. The author knows of no other projects with data for comparison. It is, of course, foolhardy to generalize from a sample of one.

Other projects are invited to contribute data for comparison.

### Singletons

It is sometimes assumed (unconsciously?) that singletons are artifacts of either NPEs or insufficient sampling ("penetration"). Let me propose, as an alternate view, the "Last Mohican" perspective that some project participants may represent the last surviving male of their patrilines. If true, singletons are essential for a project to have completed surveyed its surname's Y-DNA.

Update Nov. 2016: The above hypothesis holds true only for project members with direct p[aternalancestry of the surname. It does not hold for those who claim maternal relationship to the surname.

## Symbols & Notation

### Review:

#### F = G + S

Where F = found lines (an integer), G = number of groups (not total membership of groups, also an integer), S = number of singletons (an integer).
Definitions: A group consists of two or more participants with matching Y-DNA, according to the project's match rules. A singleton is a participant with no matches found within the project.

### New Symbols

• N = Total project participants (an integer)
• P = Project penetration relative to world's total target population, a fraction. 0>= P <=1.
• Δ = finite change in a variable; ΔS is a finite change in S. By definitions of N, S & G, this is an integer.
• ΔS/ΔN = rate of (finite) change in S with respect to change in N.
• d = an infinitesimal change in a variable, dS is a tiny (not necessarily integer) change in S.
• dS/dN  = "instantaneous rate of change in S with respect to change in N".
• This is, in analytic geometry, the slope of a function's curve; the shape of the curve can be obtained by integration, e.g., ∫(dS/dN)dN

T.

## Limits of G & S

S & G are inversely related. For equal project size, if G (through some magical process) is to increase, S must decrease. And if (through another magical process) S is to increase, G must decrease.

To remove the "magical process",  we might imagine different projects with the same target population, but different samplings.

Let’s call the project size N, for number of participants; a change in project size is ΔN; and average Group size is GS. There are three hypothetical scenarios illustrating the limits of S & G.

• The “no one matches anyone” scenario: All participants are singletons; there are no groups.
S will have an absolute maximum (upper limit) of N, in which case G=0.
Lim(S, Gà0) = N

• The “everyone matches someone” scenario: There are no singletons; all participants belong to groups.
G will have an upper limit of N/2  because the minimum size group is 2.
Lim(G, Sà0) = N/2. (This is a case in which GS=2.)
• In the real world, group sizes are =>2;  average group size is GS>2.
• Diana Gale Mathieson has said (paraphrased) that a project’s DNA survey phase can’t be considered complete until S=0 & minimum group size >=3. (I question the statement’s truth and its basis.)
• The “everyone matches everyone” scenario: There is one group. to which all participants belong; there are no singletons. This may apply to single-source (usually, rare) surnames.
The lower limit of G = 1; lower limit of S = 0.

• Upper & lower limits summary
Upper limits Lower limits
Lim(S, Gà0) = N  Lim(S, 1>=G<=N/2)=0
Lim(G, Sà0)=N/2  Lim(G, 0>=S<=N )=0

In the real world, N is not fixed but, mostly & hopefully, increases over time. For ΔN,

• Most probable ΔS ≈ (1-F/A)*ΔN,  dS/dN → (1-F/A) and
• Most probable ΔG ≈ F/A*(ΔN/GS), dS/dN → F/A/GS
• Because F=G+S, we have G & S on both sides of the derivatives. this problem could be sorted out through algebraic manipulation.
• Most probable ΔS ≈ [1-(G+S)]*ΔN = ΔN - GΔN - SΔN, ΔSΔN+S  ≈ ΔN*(1-G)
• Most probable ΔG ≈ ΔN/[A/GS]
• However, describing curve slopes in probabilistic terms doesn't warrant further refinement; “most probable” is too often a crapshoot.

At the extremes of F/A,  (0 <=  F/A <= 1):

• At F/A=0, ΔS/ΔN(1-F/A) = 1, ΔG/ΔN = F/A = 0.
(This would seem to be an impossible case; it supposes a project with no participants, i.e.. S=0, G=0.)
• A more realistic limit is at N=1; S=1, G=0, F/A=1/A.  ΔS/ΔN ≈ (1-1/A), ΔG/ΔN ≈ 1/A.
• Assuming A=10, dG/dN ≈ 1/10 ≈  10%, dS/dN ≈ (1-1/10) ≈ 90%
• Assuming A=100, dG/dN ≈ 1/100 ≈  1%, dS/dN ≈  (1- 1/100) ≈ 99%.

• At F/A=1, ΔS/ΔN=(1-F/A)=0, ΔG/ΔN=F/A=1.

#### At low values of N:

 At N=1, G=0, S=1, N=2, 0<=G<=1, 0<=S<=2 N=3, 0<=G<=1, 0<=S<=3 N=4, 0<=G<=2, 0<=S<=4 N=5, 0<=G<=2, 0<=S<=5 N=6, 0<=G<=3, 0<=S<=6 N=7, 0<=G<=3, 0<=S<=7 N=8, 0<=G<=4, 0<=S<=8 N=9, 0<=G<=4, 0<=S<=9 N=10, 0<=G<=5, 0<=S<=10

Gmax follows this law: For even numbers Gmax = N/2 and for odd numbers Gmax = (N-1)/2.
And, Smax = N.

## Rates of change between S & G

From the definition of a group as >=2:

• Minimum ΔS/ΔG = -1 (Singleton joins a prior group.)

## Projects with differing target populations

For projects of different target populations, is there any reason why there should be any S-G relationships between one project and another.  Should Adair Y-DNA behave the same as Baker or Cruse?

## High penetration levels

Until now, we have assumed low penetration levels which fit with present experience, that is P<<<1, in a range of approximately 10-5 < P < 10-3, . Within this range, the population being sampled is statistically "infinite" relative to sample size.

As N → W (for world population)  -- i.e., P→ 1 -- we might expect G & S to behave differently than at present low penetration levels. This supposes participation in the thousands for rare surnames, in the hundreds of thousands or millions for common surnames.

### At P=1 or P≈1, what is the relationship between N, S & G?

• as N → W & P→ 1, the ratio F/A → 1.
• F (=G+S) will have reached its maximum (upper limit), i.e., F=Fmax & ΔF/ΔN=0, for the reason that the population to be sampled is exhausted. .
• G will have reached its maximum;  G=Gmax & ΔG/ΔN=0, for the same reason.
• S will consist only (or, at P<1, primarily) of those whose lines have no other surviving descendants.
S will have reached its maximum,  S=Smax & dS → 0

### At 0.5>= P <=0.9, what is the relationship between N, S & G?

This is a sampling-without-replacement problem in which the sample is large in relation to the population being sampled. .Binomial probabilities apply.

Imagine a bucket of balls of different colors: The number originally in the bucket is W, the number of colors is A (number of ancestral lines), and the number of balls already picked (selected & examined) is N. What are the chances that a ball picked at random will match one or more of the balls previously picked?

With success defined as matching one of the existing groups or singletons -- thus creating a new group --
Pr(k=x), for exactly "k" successes (matches) in "n" trials (new participants):

#### Pr(k) = n!/[k!*(n-k)!] * (F/A)^k * (1-F/A)^(n-k)

Let k = ΔN = 1.
As n-k → 0, (n-k)! → 1