AT

Screening & Diagnostic Test Evaluation

Community Medicine · Epidemiology · lean revision notes

Screening & Diagnostic Test Evaluation

The art of judging how good a test is. This is the single most calculation-heavy area of Community Medicine in NEET PG — expect at least one 2×2 table sum, one PPV-versus-prevalence concept question, and one cut-off/ROC question every year. Master the 2×2 table and half the chapter is done.

Screening vs Diagnostic Test — the core distinction

A screening test is applied to apparently healthy people to sort them into probably diseased vs probably disease-free. It is presumptive, not definitive. A diagnostic test is applied to those who screen positive (or who are symptomatic) to establish the actual diagnosis.

Feature Screening test Diagnostic test
Target population Apparently healthy, asymptomatic Symptomatic / screen-positive
Purpose Detect probable disease early Establish/confirm diagnosis
Basis Single examination/criterion Detailed work-up
Cost Cheap, quick Often expensive, time-consuming
Accuracy needed High sensitivity preferred High specificity / accuracy
Result Not diagnostic — needs confirmation Definitive (used to treat)
Initiated by Investigator / programme Patient with complaint

High-yield: A screening test should be highly sensitive (so it misses few true cases). The confirmatory test that follows should be highly specific (so few healthy people are wrongly labelled and treated).

Types of screening — mass (whole population), high-risk/selective (e.g. mammography in BRCA carriers), multiphasic (several tests at one sitting), and opportunistic (done when a patient attends for some other reason).

Validity vs Reliability

Two independent properties of a test:

  • Validity (accuracy): does the test measure what it is supposed to measure? Expressed by sensitivity and specificity. Validity is judged against a gold standard / reference standard (the test accepted as the true measure of disease).
  • Reliability (reproducibility / precision): does the test give the same result on repeated application? Components: inter-observer variation, intra-observer variation, and instrument variation. Quantified by Kappa (κ) statistic for categorical data.

High-yield: A test can be reliable but not valid (consistently wrong) — like a weighing scale always reading 5 kg high. Validity is about closeness to truth; reliability is about consistency.

Cohen's Kappa interpretation (agreement beyond chance):

Kappa value Agreement
< 0 Poor (less than chance)
0.01–0.20 Slight
0.21–0.40 Fair
0.41–0.60 Moderate
0.61–0.80 Substantial
0.81–1.00 Almost perfect

The 2×2 table — the heart of the chapter

Always lay it out the same way: disease status in columns, test result in rows.

Disease Present Disease Absent Total
Test Positive TP (a) FP (b) a + b
Test Negative FN (c) TN (d) c + d
Total a + c b + d N

Memorise the layout: a = true positive, b = false positive, c = false negative, d = true negative.

Sensitivity and Specificity (intrinsic to the test)

  • Sensitivity = ability to correctly identify the diseased = TP / (TP + FN) = a / (a + c). Read down the disease-present column.
  • Specificity = ability to correctly identify the non-diseased = TN / (TN + FP) = d / (b + d). Read down the disease-absent column.

Mnemonics:

  • SnNout: a highly Snsitive test, when Negative, rules out disease.
  • SpPin: a highly Specific test, when Positive, rules in disease.
  • PIDPositive In Disease → Sensitivity is the positivity in diseased. Negative In Health → Specificity.

High-yield: Sensitivity and specificity are properties of the test itself — they do NOT change with disease prevalence (they may vary with disease spectrum/severity, but not with how common disease is in the population). This is the most repeated concept-trap in NEET PG.

Predictive values (depend on prevalence)

  • Positive Predictive Value (PPV) = probability that a test-positive person truly has disease = TP / (TP + FP) = a / (a + b). Read across the positive row.
  • Negative Predictive Value (NPV) = probability that a test-negative person is truly disease-free = TN / (TN + FN) = d / (c + d). Read across the negative row.

High-yield: PPV rises and NPV falls as prevalence increases. In a rare disease (low prevalence), even an excellent test gives a poor PPV because false positives swamp the few true positives. This is the favourite stem in the exam.

False positive rate = b / (b + d) = 1 − Specificity. False negative rate = c / (a + c) = 1 − Sensitivity. Accuracy = (TP + TN) / N = (a + d) / N.

A worked numerical (do it exactly like the exam wants)

A test is applied to 1000 people; 100 truly have the disease. The test detects 80 of the 100 cases and is positive in 90 of the 900 healthy people.

Fill the table: TP = 80, FN = 20, FP = 90, TN = 810.

  • Sensitivity = 80 / 100 = 80%
  • Specificity = 810 / 900 = 90%
  • PPV = 80 / (80 + 90) = 80 / 170 = 47%
  • NPV = 810 / (810 + 20) = 810 / 830 = 97.6%
  • Accuracy = (80 + 810)/1000 = 89%

Now drop prevalence to 1% (10 diseased, 990 healthy) with the same sensitivity 80% and specificity 90%: TP = 8, FN = 2, FP = 99, TN = 891. PPV = 8 / (8 + 99) = 7.5% — collapsed, while sensitivity and specificity are unchanged. This single demonstration captures the most tested principle of the chapter.

Stepwise approach to any 2×2 question: 1) Draw the table with disease in columns. 2) Place the four cells (a, b, c, d). 3) For sensitivity/specificity read vertically; for PPV/NPV read horizontally. 4) Plug the formula. Columns → test validity; Rows → predictive value.

Likelihood Ratios (LR) — prevalence-independent and powerful

  • LR positive (LR+) = Sensitivity / (1 − Specificity) = how much a positive result raises the odds of disease. LR+ = TP rate / FP rate.
  • LR negative (LR−) = (1 − Sensitivity) / Specificity = how much a negative result lowers the odds of disease.

Interpretation cut-offs:

LR+ Effect on probability LR− Effect
> 10 Large, conclusive increase < 0.1 Large, conclusive decrease
5–10 Moderate increase 0.1–0.2 Moderate decrease
2–5 Small increase 0.2–0.5 Small decrease
1–2 Minimal 0.5–1 Minimal

High-yield: Like sensitivity and specificity, LRs are independent of prevalence, which is why they are clinically preferred. Post-test odds = pre-test odds × LR (the basis of the Fagan nomogram).

Choosing the cut-off — the sensitivity–specificity trade-off

Most tests give a continuous result (e.g. blood glucose). A cut-off point must be chosen to dichotomise into positive/negative. As you move the cut-off, sensitivity and specificity trade off in opposite directions:

Lower the cut-off → more people called positive → ↑ Sensitivity, ↓ Specificity (more FP). Raise the cut-off → fewer called positive → ↑ Specificity, ↓ Sensitivity (more FN).

Choice depends on the cost of errors:

  • Set cut-off to favour sensitivity when missing a case is dangerous and confirmation is easy/safe (e.g. screening blood donors for HIV — never miss an infected unit).
  • Set cut-off to favour specificity when a false positive leads to harmful, expensive, or irreversible action (e.g. before major surgery, before toxic chemotherapy).

High-yield: When the diseased and non-diseased distributions overlap, you cannot maximise both sensitivity and specificity at once — there will always be FP and FN. Only if the two distributions are completely separate can a cut-off give 100% of both.

ROC curve and Area Under the Curve (AUC)

The Receiver Operating Characteristic (ROC) curve plots Sensitivity (true positive rate) on the Y-axis against (1 − Specificity) = false positive rate on the X-axis, for every possible cut-off.

  • The best overall cut-off is the point on the curve nearest the top-left corner (high sensitivity and high specificity). The Youden index (J = Sensitivity + Specificity − 1) is maximised there.
  • The diagonal line (45°) represents a useless test (AUC = 0.5, no better than chance).
  • AUC summarises overall discriminatory power across all cut-offs:
AUC Test quality
0.5 Useless (chance)
0.7–0.8 Acceptable
0.8–0.9 Good / excellent
> 0.9 Outstanding
1.0 Perfect test

High-yield: ROC is the tool to compare two diagnostic tests — the test with the larger AUC is superior. The closer the curve bows towards the upper-left corner, the better the test.

Improving test performance: series vs parallel testing

When two tests are combined:

Strategy How called positive Effect Use
Series (sequential) Positive only if BOTH tests positive ↑ Specificity & PPV, ↓ Sensitivity Confirm disease; reduce false positives (e.g. ELISA then Western blot for HIV)
Parallel (simultaneous) Positive if EITHER test positive ↑ Sensitivity & NPV, ↓ Specificity Rule out disease; emergencies; rapid screening

Mnemonic: Parallel for Picking up (sensitivity); Series to be Sure (specificity).

Wilson & Jungner criteria for a valid screening programme (WHO, 1968)

These ten classic criteria define when a screening programme is justified. Very frequently asked as "which of the following is NOT a criterion."

  1. The condition should be an important health problem.
  2. There should be an accepted treatment for those detected.
  3. Facilities for diagnosis and treatment should be available.
  4. There should be a recognisable latent or early symptomatic stage.
  5. A suitable test or examination should exist (valid, reliable, acceptable).
  6. The test should be acceptable to the population.
  7. The natural history of the condition should be adequately understood.
  8. There should be an agreed policy on whom to treat as patients.
  9. The cost of case-finding should be economically balanced against expenditure on care as a whole.
  10. Case-finding should be a continuous process, not a once-and-for-all project.

High-yield: Two prerequisites of the disease for screening to make sense — it must have a long detectable preclinical phase (DPCP) and an effective early treatment that improves prognosis. Screening a disease with no effective treatment only adds anxiety and lead time, not benefit.

Biases that falsely inflate screening "success"

  • Lead-time bias: screening advances the time of diagnosis without truly prolonging life — survival appears longer only because the clock started earlier.
  • Length-time bias: screening preferentially detects slow-growing, indolent disease (which spends longer in the detectable preclinical phase), making screened cases look more survivable than they are.
  • Overdiagnosis bias: an extreme of length bias — detection of disease that would never have caused harm in the person's lifetime (e.g. some screen-detected prostate or thyroid cancers).
  • Selection / volunteer bias: health-conscious volunteers tend to be healthier anyway.

High-yield: The only unbiased way to prove a screening programme saves lives is a randomised controlled trial using disease-specific mortality (not survival time) as the outcome.

Key differentials / commonly confused pairs

  • Sensitivity vs PPV: sensitivity asks "of the diseased, how many test positive?"; PPV asks "of the test-positives, how many are diseased?" Direction of reasoning is opposite (column vs row).
  • Incidence/Prevalence vs validity indices: prevalence changes PPV/NPV but not sensitivity/specificity/LR.
  • Validity vs reliability: accuracy vs reproducibility.
  • Yield vs validity: yield = number of previously unrecognised cases detected and brought to treatment; it depends on sensitivity and on prevalence and on how often the population is screened.

Recently asked / exam angle

  • Given a 2×2 table, calculate sensitivity, specificity, PPV, NPV — almost guaranteed. Watch the orientation; some stems put the table in non-standard order to trip you.
  • "Which index changes with prevalence?" Answer: PPV and NPV (sensitivity, specificity, LR do not).
  • Lowering the cut-off of a continuous test → increases sensitivity, decreases specificity.
  • ROC curve: Y-axis = sensitivity, X-axis = 1 − specificity; best test = largest AUC / curve closest to top-left.
  • Best test to compare two diagnostic tests → ROC / AUC.
  • Series testing increases specificity; parallel testing increases sensitivity.
  • HIV testing algorithm (screening ELISA → confirmatory Western blot) tests the screening-then-confirmation and series-testing concepts together.
  • Wilson–Jungner: "which is NOT a criterion" type. Note that low cost of the test alone or high prevalence alone are distractors — the criterion is cost-effectiveness balanced against overall care.
  • Lead-time vs length-time bias identification from a scenario.
  • LR+ formula = Sn / (1 − Sp); recognise LR+ > 10 and LR− < 0.1 as clinically conclusive.

Rapid revision

  1. Sensitivity = a/(a+c) = TP/(TP+FN) — read down the diseased column; SnNout rules out.
  2. Specificity = d/(b+d) = TN/(TN+FP) — SpPin rules in.
  3. PPV = a/(a+b); NPV = d/(c+d) — read across the rows; both depend on prevalence.
  4. As prevalence ↑ → PPV ↑ and NPV ↓; sensitivity & specificity unchanged.
  5. Screening test → high sensitivity; confirmatory test → high specificity.
  6. Lowering cut-off → ↑ sensitivity, ↓ specificity (more false positives).
  7. ROC: Y = sensitivity, X = (1 − specificity); best cut-off near top-left; bigger AUC = better test (0.5 = useless, 1.0 = perfect).
  8. LR+ = Sn/(1−Sp); LR− = (1−Sn)/Sp; both independent of prevalence; LR+ >10 and LR− <0.1 are conclusive.
  9. Series testing (both positive) ↑ specificity; parallel testing (either positive) ↑ sensitivity.
  10. Validity = accuracy (vs gold standard); reliability = reproducibility (Kappa for categorical agreement).
  11. Wilson–Jungner: important problem, detectable latent stage, accepted effective treatment, suitable acceptable test, facilities available, cost-effective, continuous process.
  12. Lead-time and length-time biases falsely inflate survival; only disease-specific mortality in an RCT proves screening benefit.