Descriptive Statistics & Data Presentation
Community Medicine · Biostatistics · lean revision notes
Descriptive Statistics & Data Presentation
Descriptive statistics summarise and organise raw data so that patterns become visible without making inferences beyond the sample. For NEET PG, this is a reliable scoring area in Community Medicine (Biostatistics): expect direct questions on type of data → correct graph, which measure of central tendency to use, and interpretation of skewness and standard deviation.
Definition & scope
Biostatistics is the application of statistical methods to biological and medical data. It has two broad arms:
- Descriptive statistics — describe and condense data (tables, graphs, mean, SD, etc.). No probability/inference involved.
- Inferential statistics — draw conclusions about a population from a sample (hypothesis tests, confidence intervals, p-values).
This topic deals exclusively with the descriptive arm. The workflow is: collect data → classify the variable type → present it (table/graph) → summarise centre and spread → comment on shape (distribution).
High-yield: The single most-tested concept is matching the type of data to the correct graphical method and correct measure of central tendency. Memorise the master table below.
Types of data (variables)
Variables are first split into qualitative (categorical) and quantitative (numerical), and then further classified using the NOIR scale of measurement (Stevens' classification).
| Scale | Type | Nature | Example | Centre measure | Math allowed |
|---|---|---|---|---|---|
| Nominal | Qualitative | Named categories, no order | Blood group, sex, religion | Mode | Counting only |
| Ordinal | Qualitative | Ordered categories, unequal/unknown gaps | Cancer stage, pain score, socio-economic class | Median | Ranking |
| Interval | Quantitative | Ordered, equal intervals, no true zero | Temperature in °C, IQ, calendar dates | Mean/Median | +, − |
| Ratio | Quantitative | Equal intervals, true zero present | Weight, height, pulse, BP, age, income | Mean | +, −, ×, ÷ |
Mnemonic — "NOIR" (French for black): Nominal → Ordinal → Interval → Ratio, in increasing order of information content.
Key distinctions tested:
- Nominal vs Ordinal: both are categorical; ordinal has an inherent order (mild < moderate < severe), nominal does not (A vs B vs O blood group).
- Interval vs Ratio: the deciding test is a true/absolute zero. 0 °C does not mean "no temperature" → interval. 0 kg means "no weight" → ratio. Because interval data lack a true zero, you cannot say 40 °C is twice as hot as 20 °C, but you can say 40 kg is twice 20 kg.
Quantitative data are also split as:
- Discrete — whole-number counts (number of children, number of episodes of diarrhoea). Cannot take fractions.
- Continuous — any value within a range (height, weight, serum sodium). Can take decimals.
High-yield: Temperature in Celsius/Fahrenheit = interval; Temperature in Kelvin = ratio (Kelvin has an absolute zero). This exact trick is frequently asked.
Frequency distribution & tabulation
Raw data are first organised into a frequency distribution table.
- Class interval (CI): the range of each group (e.g., 10–19, 20–29).
- Class limits vs class boundaries: boundaries remove gaps (9.5–19.5) for continuous plotting.
- Cumulative frequency: running total — basis for the ogive and for computing median, quartiles, percentiles.
- Sturges' rule for number of classes: k = 1 + 3.322 log₁₀(n).
Measures of central tendency
These describe the "centre" or typical value.
| Measure | Definition | Best for | Affected by outliers? |
|---|---|---|---|
| Mean | Sum ÷ n (arithmetic average) | Symmetric/normal, ratio/interval data | Yes (very sensitive) |
| Median | Middle value when ordered (50th percentile) | Skewed data, ordinal data | No (robust) |
| Mode | Most frequent value | Nominal data | No |
Calculation reminders:
- Median position = (n + 1)/2 th observation. For even n, average the two middle values.
- A dataset can have no mode, one mode (unimodal), two (bimodal) or more.
- Geometric mean (nth root of product) is used for rates, ratios, titres (e.g., antibody titres after vaccination, serial dilutions) and for highly right-skewed data. Harmonic mean is used for averaging rates/speeds.
High-yield: For income, hospital length-of-stay, serum bilirubin, incubation period and other right-skewed distributions, the median is the preferred measure of central tendency because the mean is dragged toward the long tail.
Empirical relationship in skewed data
For a moderately skewed unimodal distribution:
Mode = 3 × Median − 2 × Mean
This Karl Pearson relationship lets you compute the missing measure if two are known — a classic numerical.
Measures of dispersion (spread)
Central tendency alone is incomplete; two datasets can share a mean yet differ wildly. Dispersion quantifies scatter.
| Measure | Formula / basis | Comment |
|---|---|---|
| Range | Max − Min | Crudest; uses only 2 values; very outlier-sensitive |
| Interquartile range (IQR) | Q3 − Q1 (middle 50%) | Robust; pairs with median; used in box plots |
| Variance (s²) | Mean of squared deviations | Units are squared |
| Standard deviation (SD) | √variance | Most used; same units as data; pairs with mean |
| Coefficient of variation (CV) | (SD ÷ Mean) × 100 | Unitless; compares variability across different units/scales |
Key points:
- Quartiles divide data into 4 parts (Q1 = 25th, Q2 = median = 50th, Q3 = 75th percentile). Deciles = 10 parts, percentiles = 100 parts.
- Sample SD divides by (n − 1) — the degrees of freedom / Bessel's correction; population SD divides by N. Dividing by (n − 1) corrects the underestimation of variance in a sample.
- CV is the tool of choice to compare the variability of two different variables (e.g., is weight more variable than height?) or the same variable in different units.
High-yield: Standard deviation (SD) describes the spread of individual observations in one sample, whereas standard error of the mean (SEM = SD/√n) describes the precision of the sample mean as an estimate of the population mean. SEM is always smaller than SD and decreases as n increases. SEM is inferential, not descriptive — but it is a favourite distractor.
The normal (Gaussian) distribution & the empirical rule
For a perfectly normal (bell-shaped, symmetrical) curve, Mean = Median = Mode, and:
- Mean ± 1 SD covers ≈ 68% of observations
- Mean ± 2 SD covers ≈ 95% (precisely 1.96 SD = 95%)
- Mean ± 3 SD covers ≈ 99.7%
High-yield: The 95% reference range / normal range of a clinical parameter = Mean ± 1.96 SD (often rounded to 2 SD). This defines the cut-off between "normal" and "abnormal" lab values.
Skewness — shape of the distribution
Skewness measures asymmetry. This is a near-guaranteed question.
Stepwise approach to identify skew: Locate the tail → the tail points toward the skew direction → the mean is pulled toward the tail → order the three measures accordingly.
| Distribution | Tail direction | Order of measures | Typical example |
|---|---|---|---|
| Positive (right) skew | Long tail to the right | Mode < Median < Mean | Income, hospital stay, incubation period, serum triglycerides |
| Symmetric (normal) | None | Mean = Median = Mode | Height, BP in healthy adults |
| Negative (left) skew | Long tail to the left | Mean < Median < Mode | Age at death in developed countries, gestational age at delivery |
High-yield: In positive skew, Mean > Median > Mode (alphabetical order if you read mean-median-mode = decreasing). The mean is always the value closest to the tail. This single fact answers most skewness MCQs.
Kurtosis describes "peakedness": leptokurtic (tall, thin), mesokurtic (normal), platykurtic (flat). Less commonly asked.
Graphical presentation of data
The examiner's favourite: match the data type to the correct diagram.
Diagrams for qualitative (categorical) data
| Graph | Use | Key feature |
|---|---|---|
| Bar chart / bar diagram | Nominal or discrete categories | Bars are separated (gaps); height = frequency |
| Pie chart / sector diagram | Categorical data as proportions of a whole | Angle of each sector = (value/total) × 360° |
| Pictogram | Categorical, for lay audiences | Uses pictures/symbols |
| Map / spot map / cartogram | Geographical/spatial distribution | E.g., John Snow's cholera map |
Diagrams for quantitative (continuous) data
| Graph | Use | Key feature |
|---|---|---|
| Histogram | Frequency distribution of continuous data | Bars touch (no gaps); area ∝ frequency |
| Frequency polygon | Continuous data; compares ≥2 distributions | Line joining midpoints of class tops |
| Frequency curve | Smoothed frequency polygon | Large n, small class intervals |
| Ogive (cumulative frequency curve) | Reading median, quartiles, percentiles | S-shaped cumulative plot |
| Scatter diagram | Relationship between two quantitative variables | Shows correlation direction/strength |
| Stem-and-leaf plot | Small datasets; retains raw values | Acts as histogram + raw data |
| Box-and-whisker plot | Shows median, IQR, range, outliers | Box = IQR (Q1–Q3); line = median |
| Line / trend graph | Data over time (time-series) | E.g., epidemic curve trend |
High-yield — Histogram vs Bar chart (the most repeated graph MCQ):
- Histogram: continuous/quantitative data, bars touch, x-axis is a continuous scale, area represents frequency.
- Bar chart: qualitative/discrete data, bars are separated, x-axis is categories, height represents frequency.
High-yield: A pie chart is best for showing component parts of a single whole (proportions). To compare components across multiple groups, a component (stacked) bar chart is better.
High-yield: Scatter diagram is the graph of choice to display correlation between two continuous variables (e.g., age vs blood pressure). The box plot is the graph of choice to display median and spread / detect outliers and to compare skewness between groups.
Special epidemiological graphs
- Epidemic curve: histogram of cases plotted against time of onset. A single sharp peak suggests a point-source (common-source) outbreak; successive peaks one incubation period apart suggest propagated (person-to-person) spread.
- Arithmetic vs semi-logarithmic scale: a semilog (ratio) graph is preferred when showing rate of change / proportional change or when values span several orders of magnitude; equal ratios appear as equal distances. Arithmetic scale shows absolute change.
Choosing the right summary — flow
Step 1: Identify data type (nominal / ordinal / interval / ratio). Step 2: If categorical → mode + bar/pie chart. Step 3: If numerical → check the distribution shape (plot histogram/box plot). Step 4a: If symmetric → report Mean ± SD. Step 4b: If skewed/has outliers → report Median (IQR).
Common errors & differentials (concept traps)
- Mean vs Median choice: Using the mean for skewed data (e.g., income) is the classic error — always prefer median there.
- SD vs SEM: SD = data spread; SEM = precision of the mean. Reporting SEM to imply low variability is misleading.
- Histogram vs Bar chart: decided by whether data are continuous (touching bars) or categorical (gaps).
- Interval vs Ratio: decided by presence of a true zero; governs whether ratios/multiplication are meaningful.
- Incidence vs prevalence, rate vs ratio vs proportion — definitional confusion (a proportion is a ratio where numerator is part of the denominator; a rate includes a time element).
- Correlation ≠ causation: scatter/correlation shows association only.
Complications of poor data presentation
While "complications" is a clinical word, in biostatistics the analogous pitfalls are:
- Misleading axes (truncated y-axis exaggerating differences).
- Inappropriate chart (pie chart for time-series, 3-D charts distorting area).
- Ecological fallacy when group-level data are applied to individuals.
- Overlapping/unequal class intervals distorting a histogram.
Recently asked / exam angle
- "Which graph for presenting blood groups in a class?" → Bar chart (or pie chart for proportions); blood group is nominal.
- "Best measure of central tendency for serum bilirubin (right-skewed)?" → Median.
- "Temperature in °C is which type of data?" → Interval (no true zero).
- "In a positively skewed distribution, the relation is?" → Mean > Median > Mode.
- "Mean ± 2 SD covers what % of normal distribution?" → ~95% (1.96 SD).
- "Graph used to detect an outlier / compare medians?" → Box-and-whisker plot.
- "Which is unaffected by extreme values — mean, median, mode, range?" → Median (and mode); mean & range are affected.
- "Coefficient of variation is used to?" → Compare variability between two different units/series.
- "Histogram is used for ___ data?" → Continuous (quantitative).
- "Standard error of mean depends on?" → SD and sample size (1/√n).
- "Epidemic curve with a single peak indicates?" → Point/common-source outbreak.
- "Divides data into 100 equal parts?" → Percentiles.
Rapid revision
- NOIR scale: Nominal → Ordinal → Interval → Ratio (rising information content).
- True zero distinguishes Ratio (weight, BP, age) from Interval (°C, IQ, calendar year); Kelvin = ratio.
- Mode → nominal, Median → ordinal/skewed, Mean → symmetric ratio data.
- Mean is the most affected by outliers; median and mode are robust.
- Positive (right) skew: Mean > Median > Mode; tail to the right (e.g., income, hospital stay).
- Negative (left) skew: Mean < Median < Mode; tail to the left.
- Normal curve: Mean = Median = Mode; 68 / 95 / 99.7% within 1 / 2 / 3 SD.
- Normal reference range = Mean ± 1.96 SD (~95%).
- SD = spread of observations; SEM = SD/√n = precision of mean (decreases with larger n).
- CV = (SD/Mean) × 100 — unitless, compares variability across different scales.
- Histogram (continuous, bars touch) vs Bar chart (categorical, bars separated); Pie chart for proportions of a whole.
- Scatter plot → correlation between two continuous variables; Box plot → median, IQR, outliers; Ogive → percentiles/median.