| Evidence Direct: Levels of Evidence |

|
There are several examples of levels of evidence. In Section 1, two summaries are given. In section 2, we have a diagrammatic representation. In Section 3, we have the NHMRC example. In Section 4, we have the CEBM (Oxford) example.
1. A summary of how evidence can be graded.
|
In simple terms, one way of looking at levels of evidence is as follows (the higher the level, the better the quality; the lower, the greater the bias): |
|
I. Strong evidence from at least one systematic review of multiple well-designed randomised controlled trials.
II. Strong evidence from at least one properly designed randomised controlled trial of appropriate size.
III. Evidence from well-designed trials such as pseudo-randomised or non-randomised trials, cohort studies, time series or matched case-controlled studies.
IV. Evidence from well-designed non-experimental studies from more than one centre or research group or from case reports.
V. Opinions of respected authorities, based on clinical evidence, descriptive studies or reports of expert committees. |
or...
- Category I: Evidence from at least one properly randomized controlled trial.
- Category II-1: Evidence from well-designed controlled trials without randomization.
- Category II-2: Evidence from well-designed cohort or case-control analytic studies, preferably from more than one center or research group.
- Category II-3: Evidence from multiple times series with or without intervention or dramatic results in uncontrolled experiments such as the results of the introduction of penicillin treatment in the 1940s.
- Category III: Opinions of respected authorities, based on clinical experience, descriptive studies and case reports, or reports of expert committees.
[Source: Harris, R.P. et al. (2001). Current methods of the U.S. Preventive Services Task Force: a review of the process. American Journal of Preventive Medicine. April 20 (3 Supplement): 21-35.]
2. A Diagrammatic representation of the levels
|
3. NHMRC levels of evidence*
|
Level |
Intervention 1 |
Diagnostic accuracy 2 |
Prognosis |
Aetiology 3 |
Screening Intervention |
|
I 4 |
A systematic review of level II
studies |
A systematic review of level
II studies |
A systematic review of level II studies |
A systematic review of level II studies |
A systematic review of level II studies |
|
II |
A randomised controlled trial |
A study of test accuracy with: an independent, blinded comparison with a valid reference standard,5 among consecutive persons with a defined clinical presentation6 |
A prospective cohort study7 |
A prospective cohort study |
A randomised controlled trial |
|
III-1 |
A pseudorandomised controlled trial
(i.e. alternate allocation or some other method) |
A study of test accuracy with: an independent, blinded comparison with a valid reference standard,5 among non-consecutive persons with a defined clinical presentation6 |
All or none8 |
All or none8 |
A pseudorandomised controlled trial
(i.e. alternate allocation or some other method) |
|
III-2 |
A comparative study with concurrent controls:
▪ Non-randomised, experimental trial9
▪ Cohort study
▪ Case-control study
▪ Interrupted time series with a control group |
A comparison with reference standard that does not meet the criteria required for
Level II and III-1 evidence |
Analysis of prognostic factors amongst persons in a single arm of a randomised controlled trial |
A retrospective cohort study |
A comparative study with concurrent controls:
▪ Non-randomised, experimental trial
▪ Cohort study
▪ Case-control study |
|
III-3 |
A comparative study without concurrent controls:
▪ Historical control study
▪ Two or more single arm study10
▪ Interrupted time series without a parallel control group |
Diagnostic case-control study6 |
A retrospective cohort study |
A case-control study |
A comparative study without concurrent controls:
▪ Historical control study
▪ Two or more single arm study |
|
IV |
Case series with either post-test or pre-test/post-test outcomes |
Study of diagnostic yield (no reference standard)11 |
Case series, or cohort study of persons at different stages of disease |
A cross-sectional study or case series |
Case series |
Explanatory notes
1 Definitions of these study designs are provided on pages 7-8 How to use the evidence: assessment and application of scientific evidence (NHMRC 2000b).
2 The dimensions of evidence apply only to studies of diagnostic accuracy. To assess the effectiveness of a diagnostic test there also needs to be a consideration of the impact of the test on patient management and health outcomes (Medical Services Advisory Committee 2005, Sackett and Haynes 2002).
3 If it is possible and/or ethical to determine a causal relationship using experimental evidence, then the ‘Intervention’ hierarchy of evidence should be utilised. If it is only possible and/or ethical to determine a causal relationship using observational evidence (ie. cannot allocate groups to a potential harmful exposure, such as nuclear radiation), then the ‘Aetiology’ hierarchy of evidence should be utilised.
4 A systematic review will only be assigned a level of evidence as high as the studies it contains, excepting where those studies are of level II evidence. Systematic reviews of level II evidence provide more data than the individual studies and any meta-analyses will increase the precision of the overall results, reducing the likelihood that the results are affected by chance. Systematic reviews of lower level evidence present results of likely poor internal validity and thus are rated on the likelihood that the results have been affected by bias, rather than whether the systematic review itself is of good quality. Systematic review quality should be assessed separately. A systematic review should consist of at least two studies. In systematic reviews that include different study designs, the overall level of evidence should relate to each individual outcome/result, as different studies (and study designs) might contribute to each different outcome.
5 The validity of the reference standard should be determined in the context of the disease under review. Criteria for determining the validity of the reference standard should be pre-specified. This can include the choice of the reference standard(s) and its timing in relation to the index test. The validity of the reference standard can be determined through quality appraisal of the study (Whiting et al 2003).
6 Well-designed population based case-control studies (eg. population based screening studies where test accuracy is assessed on all cases, with a random sample of controls) do capture a population with a representative spectrum of disease and thus fulfil the requirements for a valid assembly of patients. However, in some cases the population assembled is not representative of the use of the test in practice. In diagnostic case-control studies a selected sample of patients already known to have the disease are compared with a separate group of normal/healthy people known to be free of the disease. In this situation patients with borderline or mild expressions of the disease, and conditions mimicking the disease are excluded, which can lead to exaggeration of both sensitivity and specificity. This is called spectrum bias or spectrum effect because the spectrum of study participants will not be representative of patients seen in practice (Mulherin and Miller 2002).
7 At study inception the cohort is either non-diseased or all at the same stage of the disease. A randomised controlled trial with persons either non-diseased or at the same stage of the disease in both arms of the trial would also meet the criterion for this level of evidence.
8 All or none of the people with the risk factor(s) experience the outcome; and the data arises from an unselected or representative case series which provides an unbiased representation of the prognostic effect. For example, no smallpox develops in the absence of the specific virus; and clear proof of the causal link has come from the disappearance of small pox after large-scale vaccination.
9 This also includes controlled before-and-after (pre-test/post-test) studies, as well as adjusted indirect comparisons (ie. utilise A vs B and B vs C, to determine A vs C with statistical adjustment for B).
10 Comparing single arm studies ie. case series from two studies. This would also include unadjusted indirect comparisons (ie. utilise A vs B and B vs C, to determine A vs C but where there is no statistical adjustment for B).
11 Studies of diagnostic yield provide the yield of diagnosed patients, as determined by an index test, without confirmation of the accuracy of this diagnosis by a reference standard. These may be the only alternative when there is no reliable reference standard.
Note A: Assessment of comparative harms/safety should occur according to the hierarchy presented for each of the research questions, with the proviso that this assessment occurs within the context of the topic being assessed. Some harms are rare and cannot feasibly be captured within randomised controlled trials; physical harms and psychological harms may need to be addressed by different study designs; harms from diagnostic testing include the likelihood of false positive and false negative results; harms from screening include the likelihood of false alarm and false reassurance results.
Note B: When a level of evidence is attributed in the text of a document, it should also be framed according to its corresponding research question eg. level II intervention evidence; level IV diagnostic evidence; level III-2 prognostic evidence.
|
4. Oxford (CEBM) levels of evidence*
| Level |
Therapy / Prevention, Aetiology / Harm |
Prognosis |
Diagnosis |
Differential diagnosis / symptom prevalence study |
Economic and decision analyses |
| 1a |
SR (with homogeneity*) of RCTs |
SR (with homogeneity*) of inception cohort studies; CDR† validated in different populations |
SR (with homogeneity*) of Level 1 diagnostic studies; CDR† with 1b studies from different clinical centres |
SR (with homogeneity*) of prospective cohort studies |
SR (with homogeneity*) of Level 1 economic studies |
| 1b |
Individual RCT (with narrow Confidence Interval‡) |
Individual inception cohort study with > 80% follow-up; CDR† validated in a single population |
Validating** cohort study with good††† reference standards; or CDR† tested within one clinical centre |
Prospective cohort study with good follow-up**** |
Analysis based on clinically sensible costs or alternatives; systematic review(s) of the evidence; and including multi-way sensitivity analyses |
| 1c |
All or none§ |
All or none case-series |
Absolute SpPins and SnNouts†† |
All or none case-series |
Absolute better-value or worse-value analyses †††† |
| 2a |
SR (with homogeneity*) of cohort studies |
SR (with homogeneity*) of either retrospective cohort studies or untreated control groups in RCTs |
SR (with homogeneity*) of Level >2 diagnostic studies |
SR (with homogeneity*) of 2b and better studies |
SR (with homogeneity*) of Level >2 economic studies |
| 2b |
Individual cohort study (including low quality RCT; e.g., <80% follow-up) |
Retrospective cohort study or follow-up of untreated control patients in an RCT; Derivation of CDR† or validated on split-sample§§§ only |
Exploratory** cohort study with good††† reference standards; CDR† after derivation, or validated only on split-sample§§§ or databases |
Retrospective cohort study, or poor follow-up |
Analysis based on clinically sensible costs or alternatives; limited review(s) of the evidence, or single studies; and including multi-way sensitivity analyses |
| 2c |
"Outcomes" Research; Ecological studies |
"Outcomes" Research |
|
Ecological studies |
Audit or outcomes research |
| 3a |
SR (with homogeneity*) of case-control studies |
|
SR (with homogeneity*) of 3b and better studies |
SR (with homogeneity*) of 3b and better studies |
SR (with homogeneity*) of 3b and better studies |
| 3b |
Individual Case-Control Study |
|
Non-consecutive study; or without consistently applied reference standards |
Non-consecutive cohort study, or very limited population |
Analysis based on limited alternatives or costs, poor quality estimates of data, but including sensitivity analyses incorporating clinically sensible variations. |
| 4 |
Case-series (and poor quality cohort and case-control studies§§) |
Case-series (and poor quality prognostic cohort studies***) |
Case-control study, poor or non-independent reference standard |
Case-series or superseded reference standards |
Analysis with no sensitivity analysis |
| 5 |
Expert opinion without explicit critical appraisal, or based on physiology, bench research or "first principles" |
Expert opinion without explicit critical appraisal, or based on physiology, bench research or "first principles" |
Expert opinion without explicit critical appraisal, or based on physiology, bench research or "first principles" |
Expert opinion without explicit critical appraisal, or based on physiology, bench research or "first principles" |
Expert opinion without explicit critical appraisal, or based on economic theory or "first principles" |
Produced by Bob Phillips, Chris Ball, Dave Sackett, Doug Badenoch, Sharon Straus, Brian Haynes, Martin Dawes since November 1998. Updated by Jeremy Howick March 2009. Notes
Users can add a minus-sign "-" to denote the level of that fails to provide a conclusive answer because:
- EITHER a single result with a wide Confidence Interval
- OR a Systematic Review with troublesome heterogeneity.
Such evidence is inconclusive, and therefore can only generate Grade D recommendations.
| * |
By homogeneity we mean a systematic review that is free of worrisome variations (heterogeneity) in the directions and degrees of results between individual studies. Not all systematic reviews with statistically significant heterogeneity need be worrisome, and not all worrisome heterogeneity need be statistically significant. As noted above, studies displaying worrisome heterogeneity should be tagged with a "-" at the end of their designated level. |
| † |
Clinical Decision Rule. (These are algorithms or scoring systems that lead to a prognostic estimation or a diagnostic category.) |
| ‡ |
See note above for advice on how to understand, rate and use trials or other studies with wide confidence intervals. |
| § |
Met when all patients died before the Rx became available, but some now survive on it; or when some patients died before the Rx became available, but none now die on it. |
| §§ |
By poor quality cohort study we mean one that failed to clearly define comparison groups and/or failed to measure exposures and outcomes in the same (preferably blinded), objective way in both exposed and non-exposed individuals and/or failed to identify or appropriately control known confounders and/or failed to carry out a sufficiently long and complete follow-up of patients. By poor quality case-control study we mean one that failed to clearly define comparison groups and/or failed to measure exposures and outcomes in the same (preferably blinded), objective way in both cases and controls and/or failed to identify or appropriately control known confounders. |
| §§§ |
Split-sample validation is achieved by collecting all the information in a single tranche, then artificially dividing this into "derivation" and "validation" samples. |
| †† |
An "Absolute SpPin" is a diagnostic finding whose Specificity is so high that a Positive result rules-in the diagnosis. An "Absolute SnNout" is a diagnostic finding whose Sensitivity is so high that a Negative result rules-out the diagnosis. |
| ‡‡ |
Good, better, bad and worse refer to the comparisons between treatments in terms of their clinical risks and benefits. |
| ††† |
Good reference standards are independent of the test, and applied blindly or objectively to applied to all patients. Poor reference standards are haphazardly applied, but still independent of the test. Use of a non-independent reference standard (where the 'test' is included in the 'reference', or where the 'testing' affects the 'reference') implies a level 4 study. |
| †††† |
Better-value treatments are clearly as good but cheaper, or better at the same or reduced cost. Worse-value treatments are as good and more expensive, or worse and the equally or more expensive. |
| ** |
Validating studies test the quality of a specific diagnostic test, based on prior evidence. An exploratory study collects information and trawls the data (e.g. using a regression analysis) to find which factors are 'significant'. |
| *** |
By poor quality prognostic cohort study we mean one in which sampling was biased in favour of patients who already had the target outcome, or the measurement of outcomes was accomplished in <80% of study patients, or outcomes were determined in an unblinded, non-objective way, or there was no correction for confounding factors. |
| **** |
Good follow-up in a differential diagnosis study is >80%, with adequate time for alternative diagnoses to emerge (for example 1-6 months acute, 1 - 5 years chronic) | Grades of Recommendation
| A |
consistent level 1 studies |
| B |
consistent level 2 or 3 studies or extrapolations from level 1 studies |
| C |
level 4 studies or extrapolations from level 2 or 3 studies |
| D |
level 5 evidence or troublingly inconsistent or inconclusive studies of any level |
"Extrapolations" are where data is used in a situation that has potentially clinically important differences than the original study situation.
|