Normal Distribution & Standard Deviation
Community Medicine · Biostatistics · lean revision notes
Normal Distribution & Standard Deviation
The normal (Gaussian) distribution is the backbone of biostatistics and a perennial favourite in NEET PG. Master the 68–95–99.7% empirical rule, the difference between standard deviation (SD) and standard error of the mean (SEM), z-scores, and the central limit theorem (CLT) — these four ideas generate the overwhelming majority of questions.
Definition & basic concepts
A frequency distribution describes how often each value of a variable occurs. When a continuous biological variable (height, weight, haemoglobin, blood pressure, serum cholesterol) is plotted for a large population, it commonly produces a symmetrical, bell-shaped curve — the normal distribution, first described in detail by Carl Friedrich Gauss (hence Gaussian curve; the equation itself owes to de Moivre and Laplace).
Key descriptive terms:
- Measures of central tendency: mean, median, mode.
- Measures of dispersion (variability): range, variance, standard deviation, coefficient of variation.
- Variance (σ²): the mean of the squared deviations from the mean.
- Standard deviation (σ or SD): the square root of variance — the single most important measure of dispersion; it has the same units as the original data.
High-yield: SD is the most commonly used and the best measure of dispersion because it uses every observation and is mathematically tractable. Variance = SD².
Properties of the normal curve
The normal distribution has a defined set of properties that are directly examinable:
- It is bell-shaped, smooth and symmetrical about the mean.
- Mean = Median = Mode and they all coincide at the centre (the peak).
- It is unimodal (single peak).
- The curve is asymptotic to the x-axis — the tails extend to ±∞ and never touch the baseline.
- The total area under the curve = 1 (or 100%); area to either side of the mean = 0.5 (50%).
- It is completely described by just two parameters: the mean (µ) determines location and the SD (σ) determines spread/width.
- The curve has points of inflection at µ ± 1σ (where concavity changes).
- Skewness = 0 and kurtosis = 3 (mesokurtic) for a true normal curve.
High-yield: A larger SD produces a flatter, wider curve; a smaller SD produces a taller, narrower curve. The mean fixes where the curve sits; the SD fixes how spread out it is.
| Feature | Normal distribution | Positively skewed | Negatively skewed |
|---|---|---|---|
| Shape | Symmetrical bell | Long tail to right | Long tail to left |
| Relationship | Mean = Median = Mode | Mean > Median > Mode | Mean < Median < Mode |
| Classic example | Height, BP, Hb | Income, serum bilirubin, hospital stay, incubation period | — |
| Best central measure | Mean | Median | Median |
High-yield: In a positively (right) skewed distribution, Mean > Median > Mode. Incubation period of a disease and parity/serum triglyceride values are classic right-skewed examples. For skewed data the median is the preferred measure of central tendency.
The empirical rule (68–95–99.7% rule)
This is the single most tested fact. In any normal distribution, a fixed percentage of observations lies within stated SD limits of the mean:
| Limits (Mean ± z·SD) | % of observations included | % outside (both tails) |
|---|---|---|
| Mean ± 1 SD | 68.27% (~68%) | ~32% |
| Mean ± 2 SD | 95.45% (~95%) | ~5% |
| Mean ± 3 SD | 99.73% (~99.7%) | ~0.3% |
Two precision-related cut-offs are frequently swapped in MCQs:
- Mean ± 1.96 SD → exactly 95% of observations.
- Mean ± 2.58 SD → exactly 99% of observations.
High-yield: "95% of values lie within Mean ± 2 SD" is an approximation; the exact 95% limit is Mean ± 1.96 SD. Examiners exploit the 2 vs 1.96 distinction. Likewise, ±2.58 SD = 99%.
Worked example: If mean fasting glucose = 90 mg/dL and SD = 10 mg/dL, then 95% of people have values between 90 ± (1.96 × 10) ≈ 70.4 to 109.6 mg/dL, and roughly 68% lie between 80 and 100 mg/dL (90 ± 1 SD).
Because the curve is symmetrical, you can split the percentages by tail:
- Below Mean − 1 SD ≈ 16%; above Mean + 1 SD ≈ 16% (since 100 − 68 = 32, split equally).
- Below Mean − 2 SD ≈ 2.5%; above Mean + 2 SD ≈ 2.5%.
Stepwise approach to a "% within limits" question:
Identify mean & SD → express the stated limit as Mean ± z·SD → match z to the empirical value (1, 2, 3 or 1.96, 2.58) → read off the percentage → if asked for one tail, halve the outside area.
Z-score (standard normal deviate)
To compare values from distributions with different means/SDs, we convert raw values into a standard score.
Z = (x − µ) / σ
where x = observation, µ = population mean, σ = SD.
- The standard normal distribution is a normal curve with mean = 0 and SD = 1.
- A z-score states how many SDs an observation lies from the mean, and its sign gives direction (above/below mean).
- z = +2 means the value is 2 SD above the mean (in the top ~2.5%); z = −1.5 means 1.5 SD below the mean.
High-yield: Z-score transformation lets you compare apples to oranges — e.g. whether a child's height (z = +1.2) is "more extreme" than its weight (z = +0.8). The higher absolute z is more deviant. Z-scores underlie WHO anthropometric standards (stunting/wasting defined as below −2 SD or −3 SD of reference median).
| z value | Area beyond (one tail) | Use |
|---|---|---|
| 1.64 / 1.645 | 5% one-tailed | One-sided 95% |
| 1.96 | 2.5% each tail | Two-sided 95% CI |
| 2.58 | 0.5% each tail | Two-sided 99% CI |
Standard error of the mean (SEM)
The standard error of the mean measures how much the sample mean is likely to vary from the true population mean — it is the SD of the sampling distribution of means, not of individual observations.
SEM = SD / √n
where SD = sample standard deviation, n = sample size.
The distinction between SD and SEM is one of the most discriminating NEET PG points:
| Aspect | Standard Deviation (SD) | Standard Error of Mean (SEM) |
|---|---|---|
| Measures | Variability of individual observations about the mean | Variability of the sample mean about the population mean |
| Formula | √(Σ(x−x̄)²/(n−1)) | SD / √n |
| Effect of ↑ sample size | Essentially unchanged (estimates population spread) | Decreases (precision improves) |
| Always | ≥ SEM | ≤ SD |
| Used to | Describe spread / define reference ranges | Construct confidence intervals, do significance tests |
| Reporting | Descriptive statistics | Inferential / precision of estimate |
High-yield: As sample size (n) increases, SEM decreases (because of √n in the denominator), so the estimate of the mean becomes more precise and confidence intervals narrow. SD does NOT shrink with larger n. This "effect of increasing sample size" is asked repeatedly.
95% Confidence interval for a mean = x̄ ± 1.96 × SEM (large samples). For small samples (n < 30) the t-distribution replaces z, using t (which is wider/flatter, with heavier tails) and degrees of freedom = n − 1.
High-yield: Quadrupling the sample size halves the SEM (because √4 = 2). To halve the standard error you must multiply n by four, not two.
Central limit theorem (CLT)
The CLT is what makes parametric statistics possible:
As sample size increases, the sampling distribution of the sample mean approaches a normal distribution — regardless of the shape of the underlying population distribution — provided n is sufficiently large (conventionally n ≥ 30).
Implications:
- Even if the raw data are skewed, the distribution of sample means will be approximately normal for large n. This justifies using z/t tests on means.
- The mean of the sampling distribution equals the population mean (µ).
- The SD of the sampling distribution = SEM = σ/√n.
High-yield: CLT applies to the distribution of sample means, not to individual data. It is the theoretical basis for why we can apply the normal-curve-based tests to large samples even from non-normal populations.
Coefficient of variation (CV)
When comparing the variability of two datasets measured in different units or with very different means (e.g. comparing variability of weight in kg vs height in cm), use the relative measure:
CV = (SD / Mean) × 100%
- CV is unitless (a percentage), allowing comparison across datasets.
- Higher CV → greater relative variability.
High-yield: Use CV (not SD) to compare dispersion between two distributions with different units or widely different means.
Investigations / applied uses (Community Medicine angle)
The normal curve underpins several public-health tools:
- Reference ("normal") ranges in laboratory medicine are typically defined as Mean ± 2 SD (covering the central ~95% of a healthy population). By definition this labels ~2.5% of healthy people as "abnormal" at each tail.
- Growth charts / WHO Z-score classification: wasting and stunting defined relative to SD below the reference median.
- Screening cut-offs and quality-control limits (Levey–Jennings charts use ±2 SD warning and ±3 SD action limits — Westgard rules).
Complications / common pitfalls (exam traps)
- Confusing SD with SEM — the classic trap. Remember only SEM contains √n.
- Assuming 2 SD = exactly 95% — it is 95.45%; the exact 95% is 1.96 SD.
- Thinking increasing n changes the SD — it stabilises the estimate of SD but does not systematically shrink it; only SEM shrinks.
- Forgetting that defining a reference range as Mean ± 2 SD inherently mislabels 5% of normal people as abnormal.
- Applying the empirical rule to skewed data — it only holds for an (approximately) normal distribution.
Key differentials (distinguishing related terms)
- SD vs Variance: SD = √Variance; Variance is in squared units, SD in original units.
- SD vs Standard error: dispersion of data vs precision of the mean.
- z-distribution vs t-distribution: z used when σ known / large n; t used for small samples (n < 30) and is flatter with heavier tails (depends on degrees of freedom).
- Normal vs Binomial vs Poisson: Normal = continuous symmetrical; Binomial = two mutually exclusive outcomes (mean = np); Poisson = rare events / counts where mean = variance (e.g. number of deaths, radioactive decay, rare disease counts).
| Distribution | Data type | Defining feature |
|---|---|---|
| Normal (Gaussian) | Continuous | Symmetrical bell; defined by µ and σ |
| Binomial | Discrete, 2 outcomes | Mean = np; "success/failure" |
| Poisson | Discrete, rare events | Mean = Variance; events over time/space |
Mnemonics & memory aids
- "68–95–99.7" for 1–2–3 SD — the empirical rule; the numbers themselves are the answer.
- "SEM has an n" → SEM = SD/√n, so SEM falls as n rises. SD has no n in this sense.
- Skew direction = where the tail points & where the mean is dragged: "the mean chases the tail" — right tail → mean pulled right → positive skew (Mean > Median > Mode).
- 1.96 ≈ 2 → 95%; 2.58 → 99% ("1.96 for ninety-six… 95").
Recently asked / exam angle
NEET PG and INI-CET have repeatedly tested:
- "What percentage of observations lie within Mean ± 2 SD?" → ~95% (95.45%); the precise 95% limit is ±1.96 SD.
- "Which decreases on increasing sample size — SD or SEM?" → SEM (SD ≈ unchanged).
- Direct formula recall: SEM = SD/√n; Z = (x − µ)/σ; CV = SD/Mean × 100.
- Skew identification: given Mean > Median > Mode → positively (right) skewed; best central measure = median.
- Statement-type: identify which property of the normal curve is TRUE (symmetry, Mean=Median=Mode, area = 1, asymptotic tails).
- Reference range definition = Mean ± 2 SD and the implication that ~5% of healthy people fall "outside normal".
- Central limit theorem: sampling distribution of the mean becomes normal as n grows, regardless of population shape.
- z vs t: small sample (n < 30, σ unknown) → use t-test / t-distribution.
Rapid revision
- Normal curve = symmetrical, bell-shaped, asymptotic; Mean = Median = Mode; total area = 1; defined by µ and σ.
- Empirical rule: Mean ± 1 SD = 68%, ± 2 SD = 95%, ± 3 SD = 99.7%.
- Exact 95% = Mean ± 1.96 SD; exact 99% = Mean ± 2.58 SD.
- SD = √Variance; best measure of dispersion; same units as data.
- SEM = SD / √n — measures precision of the sample mean; falls as n rises; SD does not.
- Z = (x − µ)/σ; standard normal curve has mean 0, SD 1.
- CV = (SD/Mean) × 100% — unitless; compares variability across different units/means.
- Positive (right) skew: Mean > Median > Mode (e.g. incubation period, income); use median.
- Central limit theorem: distribution of sample means → normal as n ≥ 30, irrespective of population shape.
- Reference range = Mean ± 2 SD, automatically labelling ~5% of healthy persons abnormal.
- Small sample (n < 30) → use t-distribution (flatter, heavier tails, df = n − 1).
- Poisson: Mean = Variance (rare events); Binomial: Mean = np (two outcomes).