Descriptive Statistics & Data Presentation

Community Medicine · Biostatistics · lean revision notes

Descriptive Statistics & Data Presentation

Descriptive statistics summarise and organise raw data so that patterns become visible without making inferences beyond the sample. For NEET PG, this is a reliable scoring area in Community Medicine (Biostatistics): expect direct questions on type of data → correct graph, which measure of central tendency to use, and interpretation of skewness and standard deviation.

Definition & scope

Biostatistics is the application of statistical methods to biological and medical data. It has two broad arms:

Descriptive statistics — describe and condense data (tables, graphs, mean, SD, etc.). No probability/inference involved.
Inferential statistics — draw conclusions about a population from a sample (hypothesis tests, confidence intervals, p-values).

This topic deals exclusively with the descriptive arm. The workflow is: collect data → classify the variable type → present it (table/graph) → summarise centre and spread → comment on shape (distribution).

High-yield: The single most-tested concept is matching the type of data to the correct graphical method and correct measure of central tendency. Memorise the master table below.

Types of data (variables)

Variables are first split into qualitative (categorical) and quantitative (numerical), and then further classified using the NOIR scale of measurement (Stevens' classification).

Scale	Type	Nature	Example	Centre measure	Math allowed
Nominal	Qualitative	Named categories, no order	Blood group, sex, religion	Mode	Counting only
Ordinal	Qualitative	Ordered categories, unequal/unknown gaps	Cancer stage, pain score, socio-economic class	Median	Ranking
Interval	Quantitative	Ordered, equal intervals, no true zero	Temperature in °C, IQ, calendar dates	Mean/Median	+, −
Ratio	Quantitative	Equal intervals, true zero present	Weight, height, pulse, BP, age, income	Mean	+, −, ×, ÷

Mnemonic — "NOIR" (French for black): Nominal → Ordinal → Interval → Ratio, in increasing order of information content.

Key distinctions tested:

Nominal vs Ordinal: both are categorical; ordinal has an inherent order (mild < moderate < severe), nominal does not (A vs B vs O blood group).
Interval vs Ratio: the deciding test is a true/absolute zero. 0 °C does not mean "no temperature" → interval. 0 kg means "no weight" → ratio. Because interval data lack a true zero, you cannot say 40 °C is twice as hot as 20 °C, but you can say 40 kg is twice 20 kg.

Quantitative data are also split as:

Discrete — whole-number counts (number of children, number of episodes of diarrhoea). Cannot take fractions.
Continuous — any value within a range (height, weight, serum sodium). Can take decimals.

High-yield: Temperature in Celsius/Fahrenheit = interval; Temperature in Kelvin = ratio (Kelvin has an absolute zero). This exact trick is frequently asked.

Frequency distribution & tabulation

Raw data are first organised into a frequency distribution table.

Class interval (CI): the range of each group (e.g., 10–19, 20–29).
Class limits vs class boundaries: boundaries remove gaps (9.5–19.5) for continuous plotting.
Cumulative frequency: running total — basis for the ogive and for computing median, quartiles, percentiles.
Sturges' rule for number of classes: k = 1 + 3.322 log₁₀(n).

Measures of central tendency

These describe the "centre" or typical value.

Measure	Definition	Best for	Affected by outliers?
Mean	Sum ÷ n (arithmetic average)	Symmetric/normal, ratio/interval data	Yes (very sensitive)
Median	Middle value when ordered (50th percentile)	Skewed data, ordinal data	No (robust)
Mode	Most frequent value	Nominal data	No

Calculation reminders:

Median position = (n + 1)/2 th observation. For even n, average the two middle values.
A dataset can have no mode, one mode (unimodal), two (bimodal) or more.
Geometric mean (nth root of product) is used for rates, ratios, titres (e.g., antibody titres after vaccination, serial dilutions) and for highly right-skewed data. Harmonic mean is used for averaging rates/speeds.

High-yield: For income, hospital length-of-stay, serum bilirubin, incubation period and other right-skewed distributions, the median is the preferred measure of central tendency because the mean is dragged toward the long tail.

Empirical relationship in skewed data

For a moderately skewed unimodal distribution:

Mode = 3 × Median − 2 × Mean

This Karl Pearson relationship lets you compute the missing measure if two are known — a classic numerical.

Measures of dispersion (spread)

Central tendency alone is incomplete; two datasets can share a mean yet differ wildly. Dispersion quantifies scatter.

Measure	Formula / basis	Comment
Range	Max − Min	Crudest; uses only 2 values; very outlier-sensitive
Interquartile range (IQR)	Q3 − Q1 (middle 50%)	Robust; pairs with median; used in box plots
Variance (s²)	Mean of squared deviations	Units are squared
Standard deviation (SD)	√variance	Most used; same units as data; pairs with mean
Coefficient of variation (CV)	(SD ÷ Mean) × 100	Unitless; compares variability across different units/scales

Key points:

Quartiles divide data into 4 parts (Q1 = 25th, Q2 = median = 50th, Q3 = 75th percentile). Deciles = 10 parts, percentiles = 100 parts.
Sample SD divides by (n − 1) — the degrees of freedom / Bessel's correction; population SD divides by N. Dividing by (n − 1) corrects the underestimation of variance in a sample.
CV is the tool of choice to compare the variability of two different variables (e.g., is weight more variable than height?) or the same variable in different units.

High-yield: Standard deviation (SD) describes the spread of individual observations in one sample, whereas standard error of the mean (SEM = SD/√n) describes the precision of the sample mean as an estimate of the population mean. SEM is always smaller than SD and decreases as n increases. SEM is inferential, not descriptive — but it is a favourite distractor.

The normal (Gaussian) distribution & the empirical rule

For a perfectly normal (bell-shaped, symmetrical) curve, Mean = Median = Mode, and:

Mean ± 1 SD covers ≈ 68% of observations
Mean ± 2 SD covers ≈ 95% (precisely 1.96 SD = 95%)
Mean ± 3 SD covers ≈ 99.7%

High-yield: The 95% reference range / normal range of a clinical parameter = Mean ± 1.96 SD (often rounded to 2 SD). This defines the cut-off between "normal" and "abnormal" lab values.

Skewness — shape of the distribution

Skewness measures asymmetry. This is a near-guaranteed question.

Stepwise approach to identify skew: Locate the tail → the tail points toward the skew direction → the mean is pulled toward the tail → order the three measures accordingly.

Distribution	Tail direction	Order of measures	Typical example
Positive (right) skew	Long tail to the right	Mode < Median < Mean	Income, hospital stay, incubation period, serum triglycerides
Symmetric (normal)	None	Mean = Median = Mode	Height, BP in healthy adults
Negative (left) skew	Long tail to the left	Mean < Median < Mode	Age at death in developed countries, gestational age at delivery

High-yield: In positive skew, Mean > Median > Mode (alphabetical order if you read mean-median-mode = decreasing). The mean is always the value closest to the tail. This single fact answers most skewness MCQs.

Kurtosis describes "peakedness": leptokurtic (tall, thin), mesokurtic (normal), platykurtic (flat). Less commonly asked.

Graphical presentation of data

The examiner's favourite: match the data type to the correct diagram.

Diagrams for qualitative (categorical) data

Graph	Use	Key feature
Bar chart / bar diagram	Nominal or discrete categories	Bars are separated (gaps); height = frequency
Pie chart / sector diagram	Categorical data as proportions of a whole	Angle of each sector = (value/total) × 360°
Pictogram	Categorical, for lay audiences	Uses pictures/symbols
Map / spot map / cartogram	Geographical/spatial distribution	E.g., John Snow's cholera map

Diagrams for quantitative (continuous) data

Graph	Use	Key feature
Histogram	Frequency distribution of continuous data	Bars touch (no gaps); area ∝ frequency
Frequency polygon	Continuous data; compares ≥2 distributions	Line joining midpoints of class tops
Frequency curve	Smoothed frequency polygon	Large n, small class intervals
Ogive (cumulative frequency curve)	Reading median, quartiles, percentiles	S-shaped cumulative plot
Scatter diagram	Relationship between two quantitative variables	Shows correlation direction/strength
Stem-and-leaf plot	Small datasets; retains raw values	Acts as histogram + raw data
Box-and-whisker plot	Shows median, IQR, range, outliers	Box = IQR (Q1–Q3); line = median
Line / trend graph	Data over time (time-series)	E.g., epidemic curve trend

High-yield — Histogram vs Bar chart (the most repeated graph MCQ):

Histogram: continuous/quantitative data, bars touch, x-axis is a continuous scale, area represents frequency.

Bar chart: qualitative/discrete data, bars are separated, x-axis is categories, height represents frequency.

High-yield: A pie chart is best for showing component parts of a single whole (proportions). To compare components across multiple groups, a component (stacked) bar chart is better.

High-yield: Scatter diagram is the graph of choice to display correlation between two continuous variables (e.g., age vs blood pressure). The box plot is the graph of choice to display median and spread / detect outliers and to compare skewness between groups.

Special epidemiological graphs

Epidemic curve: histogram of cases plotted against time of onset. A single sharp peak suggests a point-source (common-source) outbreak; successive peaks one incubation period apart suggest propagated (person-to-person) spread.
Arithmetic vs semi-logarithmic scale: a semilog (ratio) graph is preferred when showing rate of change / proportional change or when values span several orders of magnitude; equal ratios appear as equal distances. Arithmetic scale shows absolute change.

Choosing the right summary — flow

Step 1: Identify data type (nominal / ordinal / interval / ratio). Step 2: If categorical → mode + bar/pie chart. Step 3: If numerical → check the distribution shape (plot histogram/box plot). Step 4a: If symmetric → report Mean ± SD. Step 4b: If skewed/has outliers → report Median (IQR).

Common errors & differentials (concept traps)

Mean vs Median choice: Using the mean for skewed data (e.g., income) is the classic error — always prefer median there.
SD vs SEM: SD = data spread; SEM = precision of the mean. Reporting SEM to imply low variability is misleading.
Histogram vs Bar chart: decided by whether data are continuous (touching bars) or categorical (gaps).
Interval vs Ratio: decided by presence of a true zero; governs whether ratios/multiplication are meaningful.
Incidence vs prevalence, rate vs ratio vs proportion — definitional confusion (a proportion is a ratio where numerator is part of the denominator; a rate includes a time element).
Correlation ≠ causation: scatter/correlation shows association only.

Complications of poor data presentation

While "complications" is a clinical word, in biostatistics the analogous pitfalls are:

Misleading axes (truncated y-axis exaggerating differences).
Inappropriate chart (pie chart for time-series, 3-D charts distorting area).
Ecological fallacy when group-level data are applied to individuals.
Overlapping/unequal class intervals distorting a histogram.

Recently asked / exam angle

"Which graph for presenting blood groups in a class?" → Bar chart (or pie chart for proportions); blood group is nominal.
"Best measure of central tendency for serum bilirubin (right-skewed)?" → Median.
"Temperature in °C is which type of data?" → Interval (no true zero).
"In a positively skewed distribution, the relation is?" → Mean > Median > Mode.
"Mean ± 2 SD covers what % of normal distribution?" → ~95% (1.96 SD).
"Graph used to detect an outlier / compare medians?" → Box-and-whisker plot.
"Which is unaffected by extreme values — mean, median, mode, range?" → Median (and mode); mean & range are affected.
"Coefficient of variation is used to?" → Compare variability between two different units/series.
"Histogram is used for ___ data?" → Continuous (quantitative).
"Standard error of mean depends on?" → SD and sample size (1/√n).
"Epidemic curve with a single peak indicates?" → Point/common-source outbreak.
"Divides data into 100 equal parts?" → Percentiles.

Rapid revision

NOIR scale: Nominal → Ordinal → Interval → Ratio (rising information content).
True zero distinguishes Ratio (weight, BP, age) from Interval (°C, IQ, calendar year); Kelvin = ratio.
Mode → nominal, Median → ordinal/skewed, Mean → symmetric ratio data.
Mean is the most affected by outliers; median and mode are robust.
Positive (right) skew: Mean > Median > Mode; tail to the right (e.g., income, hospital stay).
Negative (left) skew: Mean < Median < Mode; tail to the left.
Normal curve: Mean = Median = Mode; 68 / 95 / 99.7% within 1 / 2 / 3 SD.
Normal reference range = Mean ± 1.96 SD (~95%).
SD = spread of observations; SEM = SD/√n = precision of mean (decreases with larger n).
CV = (SD/Mean) × 100 — unitless, compares variability across different scales.
Histogram (continuous, bars touch) vs Bar chart (categorical, bars separated); Pie chart for proportions of a whole.
Scatter plot → correlation between two continuous variables; Box plot → median, IQR, outliers; Ogive → percentiles/median.

← Back to hub Practice MCQs →