Measuring Haplotype Rarity

Thoughts on how to sett plain-language categorizations of continuous quantitative data.

And, some thoughts on mathematical models of the distributions seen.

Scale-setting

A plain-language interpretation of any measurement is valuable; it translates abstract numbers into easily-understood categories, putting the numbers into context. However, there must be standards; the interpretations should be based in and reflect reality.

The five-category scale (very common, common, average, uncommon & rare) proposed by Casey is good but we would recommend these standards:

Number of Categories

The minimum number of classifications is two. However, this binary (yes/no) scale is not especially descriptive of continuous data. More categories are desirable.

On the other hand, the number of categories should be kept to a small number, perhaps, no more than seven. The fewer the categories, the easier for a user to follow and remember.

An odd number of categories (3, 5, 7, etc.) has the advantage of providing a middle range. For an even number (2, 4, 6, 8) the middle range must be split between categories.

Distributions

The scale-setter must also decide which sort of distribution to model and how to model it.


For example, we chose a model like the diagram on the right, primarily on the basis of the percentages to include in each category. If the data were normally distributed (a dubious proposition) the middle category includes scores ±0.84σ from the median (±25%). The categories immediately left & right (20% each) include from 0.84σ to 1.65σ. The outermost categories (5% each) include >1.65σ.


What if we'd chosen a model like the one on the left? It has the middle category ±1σ from the median, and (assuming normality) including 68% of scores.  The categories immediately left & right  are 1-1.5σ wide and include 9% each The outermost categories (7% each) include >1.5σ.

 


Or, we might have chosen an inter-quintal range, with each of the  five categories including 20% of scores. The problem with this model is the assumption that scores are evenly distributed; they are not.

In short, model choice represents a compromise between the mathematically "ideal" and clear portrayal.

What type of distribution?

Visual inspection of the curves leads to a suggestion that the data distributions most closely resemble mathematical Gamma functions (of which, Chi-square is an example).

Here, for example are graphs of WApM and χ2 distributions:


WApM

χ2