# Measuring Haplotype Rarity

Thoughts on how to sett plain-language categorizations of continuous quantitative data.

And, some thoughts on mathematical models of the distributions seen.

## Scale-setting

A plain-language interpretation of any measurement is valuable; it translates abstract numbers into easily-understood categories, putting the numbers into context. However, there must be standards; the interpretations should be based in and reflect reality.

The five-category scale (very common, common, average, uncommon & rare) proposed by Casey is good but we would recommend these standards:

• One point on the scale must be defined, in order to assign the remaining points. We recommend that this be the mid-point of all scores, to be the middle point of the “average” category.

• An "average" (or "medium" or "moderate") category represents the middle range of observed measurements and should include those haplotypes at or near a central tendency – mean, median or mode. It need not be of the same size as other categories but may be broader. (We propose the middle 50% of haplotypes.)

• The median score (for which half of scores are lower and half higher) appears the best central tendency measure for the mid-point of the “average” category. It serves better than either the mean (average) or mode (most frequent score).
• The mean can be unduly affected by very high scores;
• The mode may not represent a central tendency (skewed distributions) or may not be a unique value in all distributions. (Some distributions have more than one mode.)
• As one progresses outward toward the ends of the scale, categories should include smaller percentages of haplotypes, in order to better discriminate. (We propose 20% each for common and uncommon and 5% each for very common and rare.)

• A hypothesized scale should be tested on a sufficiently broad and diverse sample and revised until it fits the observed data.

### Number of Categories

The minimum number of classifications is two. However, this binary (yes/no) scale is not especially descriptive of continuous data. More categories are desirable.

On the other hand, the number of categories should be kept to a small number, perhaps, no more than seven. The fewer the categories, the easier for a user to follow and remember.

An odd number of categories (3, 5, 7, etc.) has the advantage of providing a middle range. For an even number (2, 4, 6, 8) the middle range must be split between categories.

### Distributions

The scale-setter must also decide which sort of distribution to model and how to model it. For example, we chose a model like the diagram on the right, primarily on the basis of the percentages to include in each category. If the data were normally distributed (a dubious proposition) the middle category includes scores ±0.84σ from the median (±25%). The categories immediately left & right (20% each) include from 0.84σ to 1.65σ. The outermost categories (5% each) include >1.65σ. What if we'd chosen a model like the one on the left? It has the middle category ±1σ from the median, and (assuming normality) including 68% of scores.  The categories immediately left & right  are 1-1.5σ wide and include 9% each The outermost categories (7% each) include >1.5σ. Or, we might have chosen an inter-quintal range, with each of the  five categories including 20% of scores. The problem with this model is the assumption that scores are evenly distributed; they are not.

In short, model choice represents a compromise between the mathematically "ideal" and clear portrayal.

## What type of distribution?

Visual inspection of the curves leads to a suggestion that the data distributions most closely resemble mathematical Gamma functions (of which, Chi-square is an example).

Here, for example are graphs of WApM and χ2 distributions: WApM χ2