This page is about possible relationships between S (number of singletons)
and G (number of groups) -- both variables being components of F (found
lines). This is a further exploration, to see what relationships there might
James Irvine (focusing on Y-DNA surname projects) defines a project's
penetration as number of participants relative to world population and
assigns the designator P to it.
Penetration for most projects is very low, on the order of 1 to 100
100,000 bearing the surname. Low penetration may disguise any relationship effects from
appearing in actual project data.
Actual data at the moment, consists of -- essentially -- one project. The author knows of no other projects with
data for comparison. It is, of course, foolhardy to generalize from a sample of
Other projects are invited to contribute data for comparison.
It is sometimes assumed (unconsciously?) that singletons are artifacts of
either NPEs or insufficient sampling ("penetration"). Let me propose, as an alternate view,
the "Last Mohican" perspective that some project participants may represent
the last surviving male of their patrilines. If true, singletons are essential for a
project to have completed surveyed its surname's Y-DNA.
Update Nov. 2016: The above hypothesis holds true only for
project members with direct p[aternalancestry of the surname. It does
not hold for those who claim maternal relationship to the surname.
Symbols & Notation
F = G + S
Where F = found lines (an integer), G = number of groups (not total membership
of groups, also an integer), S = number of singletons (an integer). Definitions: A group consists of two or more participants with
matching Y-DNA, according to the project's match rules. A singleton is a
participant with no matches found within the project.
N = Total project participants (an integer)
P = Project penetration relative to world's total target population,
a fraction. 0>= P <=1.
Δ = finite change in a variable; ΔS is a finite change in S. By
definitions of N, S & G, this is an integer.
ΔS/ΔN = rate of (finite) change in S with respect to change in N.
d = an infinitesimal change in a variable, dS is a tiny (not
necessarily integer) change in S.
dS/dN = "instantaneous rate of change in S with respect
to change in N".
This is, in analytic geometry, the slope of a function's curve;
the shape of the curve can be obtained by integration, e.g.,
Limits of G & S
S & G are
inversely related. For equal project size, if G (through some magical
process) is to increase, S must decrease. And if (through another
magical process) S is to increase, G must decrease.
To remove the "magical process", we might imagine
different projects with the same target population, but different
the project size N, for number of participants; a change in project size is
ΔN; and average Group size is GS. There are three hypothetical scenarios
illustrating the limits of S & G.
one matches anyone” scenario: All participants are singletons;
there are no groups.
S will have an absolute maximum (upper
limit) of N, in which case G=0.
“everyone matches someone” scenario: There are no singletons; all
participants belong to groups.
G will have an upper limit of N/2
because the minimum size group is 2.
Lim(G, Sà0) = N/2.
(This is a case in which GS=2.)
In the real world, group sizes are =>2; average group size is GS>2.
Diana Gale Mathieson has said (paraphrased) that a project’s DNA survey phase
can’t be considered complete until S=0 & minimum group size >=3. (I
question the statement’s truth and its basis.)
The “everyone matches everyone” scenario: There is one group. to
which all participants belong; there are no singletons. This may apply to
single-source (usually, rare) surnames.
The lower limit of G = 1; lower limit of S = 0.
Upper & lower limits summary
Lim(G, 0>=S<=N )=0
In the real world, N is not fixed but, mostly & hopefully, increases over time. For ΔN,
Most probable ΔS ≈ (1-F/A)*ΔN, dS/dN → (1-F/A) and
Most probable ΔG ≈ F/A*(ΔN/GS), dS/dN → F/A/GS
Because F=G+S, we have G & S on both sides of the derivatives. this problem could be
sorted out through algebraic manipulation.
Gmax follows this law: For even numbers Gmax = N/2 and for odd numbers Gmax
And, Smax = N.
Rates of change between S & G
From the definition of a group as >=2:
Minimum ΔS/ΔG = -1 (Singleton joins a prior
Projects with differing target populations
For projects of different target populations, is there any reason why there should be any S-G
relationships between one project and another. Should Adair Y-DNA behave
the same as Baker or Cruse?
High penetration levels
Until now, we have assumed low penetration levels which fit with present
experience, that is P<<<1, in a range of approximately
10-5 < P < 10-3, .
Within this range, the population being sampled is statistically "infinite"
relative to sample size.
As N → W (for world population) -- i.e., P→ 1 -- we might
expect G & S to behave differently than at present low penetration
levels. This supposes participation in the thousands for rare surnames, in the
hundreds of thousands or millions for common surnames.
At P=1 or P≈1, what is the relationship between N, S & G?
as N → W & P→ 1, the ratio F/A → 1.
F (=G+S) will have reached its maximum (upper limit), i.e., F=Fmax
ΔF/ΔN=0, for the reason that the population to be sampled
is exhausted. .
G will have reached its maximum; G=Gmax
& ΔG/ΔN=0, for the
S will consist only (or, at P<1, primarily) of those
whose lines have no other surviving descendants.
S will have reached its maximum, S=Smax & dS → 0
At 0.5>= P <=0.9, what is the relationship between N, S & G?
This is a sampling-without-replacement problem in which the sample is large
in relation to the population being sampled. .Binomial probabilities apply.
Imagine a bucket of balls of different colors: The number originally in the
bucket is W, the number of colors is A (number of ancestral lines), and the
number of balls already picked (selected & examined) is N. What are the chances
that a ball picked at random will match one or more of the balls previously
With success defined as matching one of the existing groups or singletons --
thus creating a new group --
Pr(k=x), for exactly "k" successes (matches) in "n"
trials (new participants):