AT

Descriptive Statistics & Data Presentation

Community Medicine · Biostatistics · lean revision notes

Descriptive Statistics & Data Presentation

Descriptive statistics summarise and organise raw data so that patterns become visible without making inferences beyond the sample. For NEET PG, this is a reliable scoring area in Community Medicine (Biostatistics): expect direct questions on type of data → correct graph, which measure of central tendency to use, and interpretation of skewness and standard deviation.

Definition & scope

Biostatistics is the application of statistical methods to biological and medical data. It has two broad arms:

  1. Descriptive statistics — describe and condense data (tables, graphs, mean, SD, etc.). No probability/inference involved.
  2. Inferential statistics — draw conclusions about a population from a sample (hypothesis tests, confidence intervals, p-values).

This topic deals exclusively with the descriptive arm. The workflow is: collect data → classify the variable type → present it (table/graph) → summarise centre and spread → comment on shape (distribution).

High-yield: The single most-tested concept is matching the type of data to the correct graphical method and correct measure of central tendency. Memorise the master table below.

Types of data (variables)

Variables are first split into qualitative (categorical) and quantitative (numerical), and then further classified using the NOIR scale of measurement (Stevens' classification).

Scale Type Nature Example Centre measure Math allowed
Nominal Qualitative Named categories, no order Blood group, sex, religion Mode Counting only
Ordinal Qualitative Ordered categories, unequal/unknown gaps Cancer stage, pain score, socio-economic class Median Ranking
Interval Quantitative Ordered, equal intervals, no true zero Temperature in °C, IQ, calendar dates Mean/Median +, −
Ratio Quantitative Equal intervals, true zero present Weight, height, pulse, BP, age, income Mean +, −, ×, ÷

Mnemonic — "NOIR" (French for black): Nominal → Ordinal → Interval → Ratio, in increasing order of information content.

Key distinctions tested:

  • Nominal vs Ordinal: both are categorical; ordinal has an inherent order (mild < moderate < severe), nominal does not (A vs B vs O blood group).
  • Interval vs Ratio: the deciding test is a true/absolute zero. 0 °C does not mean "no temperature" → interval. 0 kg means "no weight" → ratio. Because interval data lack a true zero, you cannot say 40 °C is twice as hot as 20 °C, but you can say 40 kg is twice 20 kg.

Quantitative data are also split as:

  • Discrete — whole-number counts (number of children, number of episodes of diarrhoea). Cannot take fractions.
  • Continuous — any value within a range (height, weight, serum sodium). Can take decimals.

High-yield: Temperature in Celsius/Fahrenheit = interval; Temperature in Kelvin = ratio (Kelvin has an absolute zero). This exact trick is frequently asked.

Frequency distribution & tabulation

Raw data are first organised into a frequency distribution table.

  • Class interval (CI): the range of each group (e.g., 10–19, 20–29).
  • Class limits vs class boundaries: boundaries remove gaps (9.5–19.5) for continuous plotting.
  • Cumulative frequency: running total — basis for the ogive and for computing median, quartiles, percentiles.
  • Sturges' rule for number of classes: k = 1 + 3.322 log₁₀(n).

Measures of central tendency

These describe the "centre" or typical value.

Measure Definition Best for Affected by outliers?
Mean Sum ÷ n (arithmetic average) Symmetric/normal, ratio/interval data Yes (very sensitive)
Median Middle value when ordered (50th percentile) Skewed data, ordinal data No (robust)
Mode Most frequent value Nominal data No

Calculation reminders:

  • Median position = (n + 1)/2 th observation. For even n, average the two middle values.
  • A dataset can have no mode, one mode (unimodal), two (bimodal) or more.
  • Geometric mean (nth root of product) is used for rates, ratios, titres (e.g., antibody titres after vaccination, serial dilutions) and for highly right-skewed data. Harmonic mean is used for averaging rates/speeds.

High-yield: For income, hospital length-of-stay, serum bilirubin, incubation period and other right-skewed distributions, the median is the preferred measure of central tendency because the mean is dragged toward the long tail.

Empirical relationship in skewed data

For a moderately skewed unimodal distribution:

Mode = 3 × Median − 2 × Mean

This Karl Pearson relationship lets you compute the missing measure if two are known — a classic numerical.

Measures of dispersion (spread)

Central tendency alone is incomplete; two datasets can share a mean yet differ wildly. Dispersion quantifies scatter.

Measure Formula / basis Comment
Range Max − Min Crudest; uses only 2 values; very outlier-sensitive
Interquartile range (IQR) Q3 − Q1 (middle 50%) Robust; pairs with median; used in box plots
Variance (s²) Mean of squared deviations Units are squared
Standard deviation (SD) √variance Most used; same units as data; pairs with mean
Coefficient of variation (CV) (SD ÷ Mean) × 100 Unitless; compares variability across different units/scales

Key points:

  • Quartiles divide data into 4 parts (Q1 = 25th, Q2 = median = 50th, Q3 = 75th percentile). Deciles = 10 parts, percentiles = 100 parts.
  • Sample SD divides by (n − 1) — the degrees of freedom / Bessel's correction; population SD divides by N. Dividing by (n − 1) corrects the underestimation of variance in a sample.
  • CV is the tool of choice to compare the variability of two different variables (e.g., is weight more variable than height?) or the same variable in different units.

High-yield: Standard deviation (SD) describes the spread of individual observations in one sample, whereas standard error of the mean (SEM = SD/√n) describes the precision of the sample mean as an estimate of the population mean. SEM is always smaller than SD and decreases as n increases. SEM is inferential, not descriptive — but it is a favourite distractor.

The normal (Gaussian) distribution & the empirical rule

For a perfectly normal (bell-shaped, symmetrical) curve, Mean = Median = Mode, and:

  • Mean ± 1 SD covers ≈ 68% of observations
  • Mean ± 2 SD covers ≈ 95% (precisely 1.96 SD = 95%)
  • Mean ± 3 SD covers ≈ 99.7%

High-yield: The 95% reference range / normal range of a clinical parameter = Mean ± 1.96 SD (often rounded to 2 SD). This defines the cut-off between "normal" and "abnormal" lab values.

Skewness — shape of the distribution

Skewness measures asymmetry. This is a near-guaranteed question.

Stepwise approach to identify skew: Locate the tailthe tail points toward the skew directionthe mean is pulled toward the tailorder the three measures accordingly.

Distribution Tail direction Order of measures Typical example
Positive (right) skew Long tail to the right Mode < Median < Mean Income, hospital stay, incubation period, serum triglycerides
Symmetric (normal) None Mean = Median = Mode Height, BP in healthy adults
Negative (left) skew Long tail to the left Mean < Median < Mode Age at death in developed countries, gestational age at delivery

High-yield: In positive skew, Mean > Median > Mode (alphabetical order if you read mean-median-mode = decreasing). The mean is always the value closest to the tail. This single fact answers most skewness MCQs.

Kurtosis describes "peakedness": leptokurtic (tall, thin), mesokurtic (normal), platykurtic (flat). Less commonly asked.

Graphical presentation of data

The examiner's favourite: match the data type to the correct diagram.

Diagrams for qualitative (categorical) data

Graph Use Key feature
Bar chart / bar diagram Nominal or discrete categories Bars are separated (gaps); height = frequency
Pie chart / sector diagram Categorical data as proportions of a whole Angle of each sector = (value/total) × 360°
Pictogram Categorical, for lay audiences Uses pictures/symbols
Map / spot map / cartogram Geographical/spatial distribution E.g., John Snow's cholera map

Diagrams for quantitative (continuous) data

Graph Use Key feature
Histogram Frequency distribution of continuous data Bars touch (no gaps); area ∝ frequency
Frequency polygon Continuous data; compares ≥2 distributions Line joining midpoints of class tops
Frequency curve Smoothed frequency polygon Large n, small class intervals
Ogive (cumulative frequency curve) Reading median, quartiles, percentiles S-shaped cumulative plot
Scatter diagram Relationship between two quantitative variables Shows correlation direction/strength
Stem-and-leaf plot Small datasets; retains raw values Acts as histogram + raw data
Box-and-whisker plot Shows median, IQR, range, outliers Box = IQR (Q1–Q3); line = median
Line / trend graph Data over time (time-series) E.g., epidemic curve trend

High-yield — Histogram vs Bar chart (the most repeated graph MCQ):

  • Histogram: continuous/quantitative data, bars touch, x-axis is a continuous scale, area represents frequency.
  • Bar chart: qualitative/discrete data, bars are separated, x-axis is categories, height represents frequency.

High-yield: A pie chart is best for showing component parts of a single whole (proportions). To compare components across multiple groups, a component (stacked) bar chart is better.

High-yield: Scatter diagram is the graph of choice to display correlation between two continuous variables (e.g., age vs blood pressure). The box plot is the graph of choice to display median and spread / detect outliers and to compare skewness between groups.

Special epidemiological graphs

  • Epidemic curve: histogram of cases plotted against time of onset. A single sharp peak suggests a point-source (common-source) outbreak; successive peaks one incubation period apart suggest propagated (person-to-person) spread.
  • Arithmetic vs semi-logarithmic scale: a semilog (ratio) graph is preferred when showing rate of change / proportional change or when values span several orders of magnitude; equal ratios appear as equal distances. Arithmetic scale shows absolute change.

Choosing the right summary — flow

Step 1: Identify data type (nominal / ordinal / interval / ratio). Step 2: If categorical → mode + bar/pie chart. Step 3: If numerical → check the distribution shape (plot histogram/box plot). Step 4a: If symmetric → report Mean ± SD. Step 4b: If skewed/has outliers → report Median (IQR).

Common errors & differentials (concept traps)

  • Mean vs Median choice: Using the mean for skewed data (e.g., income) is the classic error — always prefer median there.
  • SD vs SEM: SD = data spread; SEM = precision of the mean. Reporting SEM to imply low variability is misleading.
  • Histogram vs Bar chart: decided by whether data are continuous (touching bars) or categorical (gaps).
  • Interval vs Ratio: decided by presence of a true zero; governs whether ratios/multiplication are meaningful.
  • Incidence vs prevalence, rate vs ratio vs proportion — definitional confusion (a proportion is a ratio where numerator is part of the denominator; a rate includes a time element).
  • Correlation ≠ causation: scatter/correlation shows association only.

Complications of poor data presentation

While "complications" is a clinical word, in biostatistics the analogous pitfalls are:

  • Misleading axes (truncated y-axis exaggerating differences).
  • Inappropriate chart (pie chart for time-series, 3-D charts distorting area).
  • Ecological fallacy when group-level data are applied to individuals.
  • Overlapping/unequal class intervals distorting a histogram.

Recently asked / exam angle

  • "Which graph for presenting blood groups in a class?"Bar chart (or pie chart for proportions); blood group is nominal.
  • "Best measure of central tendency for serum bilirubin (right-skewed)?"Median.
  • "Temperature in °C is which type of data?"Interval (no true zero).
  • "In a positively skewed distribution, the relation is?"Mean > Median > Mode.
  • "Mean ± 2 SD covers what % of normal distribution?" → ~95% (1.96 SD).
  • "Graph used to detect an outlier / compare medians?"Box-and-whisker plot.
  • "Which is unaffected by extreme values — mean, median, mode, range?"Median (and mode); mean & range are affected.
  • "Coefficient of variation is used to?"Compare variability between two different units/series.
  • "Histogram is used for ___ data?"Continuous (quantitative).
  • "Standard error of mean depends on?" → SD and sample size (1/√n).
  • "Epidemic curve with a single peak indicates?"Point/common-source outbreak.
  • "Divides data into 100 equal parts?"Percentiles.

Rapid revision

  1. NOIR scale: Nominal → Ordinal → Interval → Ratio (rising information content).
  2. True zero distinguishes Ratio (weight, BP, age) from Interval (°C, IQ, calendar year); Kelvin = ratio.
  3. Mode → nominal, Median → ordinal/skewed, Mean → symmetric ratio data.
  4. Mean is the most affected by outliers; median and mode are robust.
  5. Positive (right) skew: Mean > Median > Mode; tail to the right (e.g., income, hospital stay).
  6. Negative (left) skew: Mean < Median < Mode; tail to the left.
  7. Normal curve: Mean = Median = Mode; 68 / 95 / 99.7% within 1 / 2 / 3 SD.
  8. Normal reference range = Mean ± 1.96 SD (~95%).
  9. SD = spread of observations; SEM = SD/√n = precision of mean (decreases with larger n).
  10. CV = (SD/Mean) × 100 — unitless, compares variability across different scales.
  11. Histogram (continuous, bars touch) vs Bar chart (categorical, bars separated); Pie chart for proportions of a whole.
  12. Scatter plot → correlation between two continuous variables; Box plot → median, IQR, outliers; Ogive → percentiles/median.