Appendix F Quality appraisal checklist – quantitative intervention studies | Methods for the development of NICE public health guidance (third edition) | Guidance

Download (PDF)

Appendix F Quality appraisal checklist – quantitative intervention studies

Public health interventions comprise a vast range of approaches, from the relatively simple through to complex national policy interventions. As a consequence, research questions about the effectiveness and efficacy of public health interventions will typically rely on quantitative evidence from a range of sources (see section 3.2). This will include evidence from small (experimental) randomised controlled trials through to large-scale observational studies (see appendix E for an algorithm outlining the range of experimental and observational quantitative study designs).

Rather than include an exhaustive list of critical appraisal tools for each individual study design, this checklist^[12] is designed to be used for randomised controlled trials, case–control studies, cohort studies, controlled before-and-after studies and interrupted time series. It is based on the 'Graphical appraisal tool for epidemiological studies (GATE)', developed by Jackson et al. (2006), revised and tailored to be more suitable for public health interventions. It is anticipated that the majority of study designs used to determine the effect of an intervention on a (quantitative) outcome will be amenable to critical appraisal with this revised tool.

It enables a reviewer to appraise a study's internal and external validity after addressing the following key aspects of study design:

characteristics of study participants
definition of, and allocation to, intervention and control conditions
outcomes assessed over different time periods
methods of analyses.

GATE is intended to be used in an electronic (Excel) format that will facilitate both the sharing and storage of data, and through linkage with other documents, the compilation of research reports. Much of the guidance to support the completion of the critical appraisal form that is reproduced below also appears in 'pop-up' windows in the electronic version^[13].

There are 5 sections of the revised GATE. Section 1 seeks to assess the key population criteria for determining the study's external validity – that is, the extent to which the findings of a study are generalisable beyond the confines of the study to the study's source population.

Sections 2 to 4 assess the key criteria for determining the study's internal validity – that is, making sure that the study has been carried out carefully, and that the outcomes are likely to be attributable to the intervention being assessed, rather than some other (often unidentified) factor. In an internally valid study, any differences observed between groups of patients allocated to receive different interventions may (apart from the possibility of random error) be attributed to the intervention under investigation. Biases are characteristics that are likely to make estimates of effect differ systematically from the truth. Each of the critical appraisal checklist questions covers an aspect of methodology that research has shown makes a significant difference to the conclusions of a study.

Checklist items are worded so that 1 of 5 responses is possible:

++	Indicates that for that particular aspect of study design, the study has been designed or conducted in such a way as to minimise the risk of bias.
+	Indicates that either the answer to the checklist question is not clear from the way the study is reported, or that the study may not have addressed all potential sources of bias for that particular aspect of study design.
−	Should be reserved for those aspects of the study design in which significant sources of bias may persist.
Not reported (NR)	Should be reserved for those aspects in which the study under review fails to report how they have (or might have) been considered.
Not applicable (NA)	Should be reserved for those study design aspects that are not applicable given the study design under review (for example, allocation concealment would not be applicable for case control studies).

In addition, the reviewer is requested to complete in detail the comments section of the quality appraisal form so that the grade awarded for each study aspect is as transparent as possible.

Each study is then awarded an overall study quality grading for internal validity (IV) and a separate one for external validity (EV):

++ All or most of the checklist criteria have been fulfilled, where they have not been fulfilled the conclusions are very unlikely to alter.

+ Some of the checklist criteria have been fulfilled, where they have not been fulfilled, or not adequately described, the conclusions are unlikely to alter.

− Few or no checklist criteria have been fulfilled and the conclusions are likely or very likely to alter.

Checklist

Study identification: (Include full citation details)

Study design:

Refer to the glossary of study designs (appendix D) and the algorithm for classifying experimental and observational study designs (appendix E) to best describe the paper's underpinning study design

Guidance topic:

Assessed by:

Section 1: Population

1.1 Is the source population or source area well described?

Was the country (e.g. developed or non-developed, type of healthcare system), setting (primary schools, community centres etc.), location (urban, rural), population demographics etc. adequately described?

−

Comments:

1.2 Is the eligible population or area representative of the source population or area?

Was the recruitment of individuals, clusters or areas well defined (e.g. advertisement, birth register)?

Was the eligible population representative of the source? Were important groups under-represented?

−

Comments:

1.3 Do the selected participants or areas represent the eligible population or area?

Was the method of selection of participants from the eligible population well described?

What % of selected individuals or clusters agreed to participate? Were there any sources of bias?

Were the inclusion or exclusion criteria explicit and appropriate?

−

Comments:

Section 2: Method of allocation to intervention (or comparison)

2.1 Allocation to intervention (or comparison). How was selection bias minimised?

Was allocation to exposure and comparison randomised? Was it truly random ++ or pseudo-randomised + (e.g. consecutive admissions)?

If not randomised, was significant confounding likely (−) or not (+)?

If a cross-over, was order of intervention randomised?

−

Comments:

2.2 Were interventions (and comparisons) well described and appropriate?

Were interventions and comparisons described in sufficient detail (i.e. enough for study to be replicated)?

Was comparisons appropriate (e.g. usual practice rather than no intervention)?

-−

Comments:

2.3 Was the allocation concealed?

Could the person(s) determining allocation of participants or clusters to intervention or comparison groups have influenced the allocation?

Adequate allocation concealment (++) would include centralised allocation or computerised allocation systems.

−

Comments:

2.4 Were participants or investigators blind to exposure and comparison?

Were participants and investigators – those delivering or assessing the intervention kept blind to intervention allocation? (Triple or double blinding score ++)

If lack of blinding is likely to cause important bias, score −.

−

Comments:

2.5 Was the exposure to the intervention and comparison adequate?

Is reduced exposure to intervention or control related to the intervention (e.g. adverse effects leading to reduced compliance) or fidelity of implementation (e.g. reduced adherence to protocol)?

Was lack of exposure sufficient to cause important bias?

−

Comments:

2.6 Was contamination acceptably low?

Did any in the comparison group receive the intervention or vice versa?

If so, was it sufficient to cause important bias?

If a cross-over trial, was there a sufficient wash-out period between interventions?

−

Comments:

2.7 Were other interventions similar in both groups?

Did either group receive additional interventions or have services provided in a different manner?

Were the groups treated equally by researchers or other professionals?

Was this sufficient to cause important bias?

−

Comments:

2.8 Were all participants accounted for at study conclusion?

Were those lost-to-follow-up (i.e. dropped or lost pre-,during or post-intervention) acceptably low (i.e. typically <20%)?

Did the proportion dropped differ by group? For example, were drop-outs related to the adverse effects of the intervention?

−

Comments:

2.9 Did the setting reflect usual UK practice?

Did the setting in which the intervention or comparison was delivered differ significantly from usual practice in the UK? For example, did participants receive intervention (or comparison) condition in a hospital rather than a community-based setting?

−

Comments:

2.10 Did the intervention or control comparison reflect usual UK practice?

Did the intervention or comparison differ significantly from usual practice in the UK? For example, did participants receive intervention (or comparison) delivered by specialists rather than GPs? Were participants monitored more closely?

−

Comments:

Section 3: Outcomes

3.1 Were outcome measures reliable?

Were outcome measures subjective or objective (e.g. biochemically validated nicotine levels ++ vs self-reported smoking −)?

How reliable were outcome measures (e.g. inter- or intra-rater reliability scores)?

Was there any indication that measures had been validated (e.g. validated against a gold standard measure or assessed for content validity)?

−

Comments:

3.2 Were all outcome measurements complete?

Were all or most study participants who met the defined study outcome definitions likely to have been identified?

−

Comments:

3.3 Were all important outcomes assessed?

Were all important benefits and harms assessed?

Was it possible to determine the overall balance of benefits and harms of the intervention versus comparison?

−

Comments:

3.4 Were outcomes relevant?

Where surrogate outcome measures were used, did they measure what they set out to measure? (e.g. a study to assess impact on physical activity assesses gym membership – a potentially objective outcome measure – but is it a reliable predictor of physical activity?)

Comments:

3.5 Were there similar follow-up times in exposure and comparison groups?

If groups are followed for different lengths of time, then more events are likely to occur in the group followed-up for longer distorting the comparison.

Analyses can be adjusted to allow for differences in length of follow-up (e.g. using person-years).

Comments:

3.6 Was follow-up time meaningful?

Was follow-up long enough to assess long-term benefits or harms?

Was it too long, e.g. participants lost to follow-up?

−

Comments:

Section 4: Analyses

4.1 Were exposure and comparison groups similar at baseline? If not, were these adjusted?

Were there any differences between groups in important confounders at baseline?

If so, were these adjusted for in the analyses (e.g. multivariate analyses or stratification).

Were there likely to be any residual differences of relevance?

−

Comments:

4.2 Was intention to treat (ITT) analysis conducted?

Were all participants (including those that dropped out or did not fully complete the intervention course) analysed in the groups (i.e. intervention or comparison) to which they were originally allocated?

−

Comments:

4.3 Was the study sufficiently powered to detect an intervention effect (if one exists)?

A power of 0.8 (that is, it is likely to see an effect of a given size if one exists, 80% of the time) is the conventionally accepted standard.

Is a power calculation presented? If not, what is the expected effect size? Is the sample size adequate?

−

Comments:

4.4 Were the estimates of effect size given or calculable?

Were effect estimates (e.g. relative risks, absolute risks) given or possible to calculate?

−

Comments:

4.5 Were the analytical methods appropriate?

Were important differences in follow-up time and likely confounders adjusted for?

If a cluster design, were analyses of sample size (and power), and effect size performed on clusters (and not individuals)?

Were subgroup analyses pre-specified?

−

Comments:

4.6 Was the precision of intervention effects given or calculable? Were they meaningful?

Were confidence intervals or p values for effect estimates given or possible to calculate?

Were CI's wide or were they sufficiently precise to aid decision-making? If precision is lacking, is this because the study is under-powered?

−

Comments:

Section 5: Summary

5.1 Are the study results internally valid (i.e. unbiased)?

How well did the study minimise sources of bias (i.e. adjusting for potential confounders)?

Were there significant flaws in the study design?

−

Comments:

5.2 Are the findings generalisable to the source population (i.e. externally valid)?

Are there sufficient details given about the study to determine if the findings are generalisable to the source population? Consider: participants, interventions and comparisons, outcomes, resource and policy implications.

−

Comments:

Notes on the use of the quantitative studies checklist

The following sections outline the checklist questions, the prompts provided as pop-up boxes in the electronic version (highlighted in boxes) and additional guidance notes to aid the reviewer in assessing the study's internal and external validity.

Section 1:

This section seeks to assess the key population criteria for determining the study's external validity.

Although there are checklists for assessing external validity of RCTs (with a particular focus on clinical interventions) (see for example [Rothwell 2005]), there don't appear to be any checklists specific for public health interventions.

The questions asked in this section ask the reviewer to identify and describe the source population of the study (that is, those the study aims to represent), the eligible population (those that meet the study eligibility criteria), and the study participants (those that agreed to participate in the study). Where a study assesses an intervention delivered to a particular geographical setting or area (rather than delivered to individuals), the questions in this section relate to describing the source area or setting, and how the study areas or settings were chosen. For example, a study might assess the effect on health outcomes of neighbourhood renewal schemes and this section seeks to identify and describe how those neighbourhoods were chosen and whether they are representative of the neighbourhoods the study seeks to represent.

External validity is defined as the extent to which the findings of a study are generalisable beyond the confines of the study itself to the source population. So, for example, findings from a study conducted in a school setting in the USA might be generalisable to other schools in the USA (the source population of the study). An assessment of external validity will consider how representative of the source population the study population is and whether or not there are any specific population, demographic or geographic features of the selected population that might limit or support generalisability. Also important are considerations of the setting, intervention and outcomes assessed. These factors will be considered in sections 2 and 3 of the checklist.

1.1 Is the source population or source area well described?

Was the source population or area described in sufficient detail? For example, country (developed or non-developed, type of healthcare system), setting (for example, primary school, community centre), location (urban, rural) and population demographics.

This question seeks to determine the study's source population or area (that is, to whom or what the study aims to represent). The source population is usually best identified by referring to the study's original research question.

It is important to consider those population demographic characteristics such as age, sex, sexual orientation, disability, ethnicity, religion, place of residence, occupation, education, socioeconomic position and social capital^[14] that can help to assess the impact of interventions on health inequalities and may help guide recommendations for specific population subgroups.

1.2 Is the eligible population or area representative of the source population or area?

Was the recruitment of individuals, clusters or areas well defined (for example, advertisement, birth register, class list, area)?

Was the eligible population or area representative of the source or were important groups under-represented?

To determine if the eligible population or area (for example, smokers responding to a media advertisement, areas of high density housing in a particular catchment area) are representative of the source population (for example, smokers or areas of high density housing), consider the means by which the eligible population was defined or identified and the implicit or explicit inclusion and exclusion criteria used. Were important groups likely to have been missed or under-represented? For example, were recruitment strategies geared toward more affluent or motivated groups? (For example, recruitment from more affluent areas or local fitness centres.) Were significant numbers of potentially eligible participants likely to have been inadvertently excluded? (For example, through referral to practitioners not involved in the research study.)

1.3 Do the selected participants or areas represent the eligible population or area?

Was the method of selection of participants from the eligible population well described?

What percentage of selected individuals or clusters agreed to participate? Were there any sources of bias?

Were the inclusion or exclusion criteria explicit and appropriate?

Consider whether the method of selection of participants or areas from the eligible population or area was well described (for example, consecutive cases or random sampling). Were any significant sources of biases likely to have been introduced? Consider what proportion of selected individuals or clusters agreed to participate. Was there a bias toward more healthier or motivated individuals or wealthier areas?

Also consider whether the inclusion and exclusion criteria were well described and whether they were appropriate given the study objectives and the source population. Strict eligibility criteria can limit the external validity of intervention studies if the selected participants are not representative of the eligible population. This has been well-documented for RCTs where recruited participants have been found to differ from those who are eligible but not recruited, in terms of age, sex, race, severity of disease, educational status, social class and place of residence (Rothwell 2005).

Finally, consider whether sufficient detail of the demographic (for example, age, education, socioeconomic status, employment) or personal health-related (for example, smoking, physical activity levels) characteristics of the selected participants were presented. Are selected participants representative of the eligible population?

Section 2: method of allocation to intervention (or comparison)

This section aims to assess the likelihood of selection bias and confounding being introduced into a study.

Selection bias exists when there are systematic differences between the participants in the different intervention groups. As a result, the differences in the outcome observed may be explained by pre-existing differences between the groups, rather than because of the intervention itself. For example, if the people in 1 group are generally in poorer health compared with the second group, then they are more likely to have a worse outcome, regardless of the effect of the intervention. The intervention groups should be similar at the start of the study so that the only difference between the groups should be the intervention received.

2.1 Allocation to intervention or comparison. How was confounding minimised?

Was allocation to exposure and comparison randomised? Was it truly random ++ or pseudo-randomised + (for example, consecutive admissions)?

If not randomised, was significant confounding likely (−) or not (+)?

If a crossover, was order of intervention randomised?

Consider the method by which individuals were allocated to either intervention or control conditions. Random allocation of individuals (as in RCTs) to receive 1 or other of the interventions under investigation, is considered the most reliable means of minimising the risk of selection bias and confounding.

If an appropriate method of randomisation has been used, each participant should have an equal chance of ending up in each of the intervention groups. Examples of random allocation sequences include random numbers generated by computer, tables of random numbers and drawing of lots or envelopes. However, if the description of randomisation is poor, or the process used is not truly random (for example, if the allocation sequence is predictable, such as date of birth or alternating between 1 group and another) or can otherwise be seen as flawed, this component should be given a lower quality rating.

2.2 Were the interventions (and comparisons) well-described and appropriate?

Were interventions and comparisons described in sufficient detail (that is, enough for study to be replicated)?

Were comparisons appropriate (for example, usual practice rather than no treatment)?

2.3 Was the allocation concealed?

Could the person(s) determining the allocation of participants or clusters to intervention or comparison groups have influenced the allocation?

Adequate allocation concealment (++) would include centralised allocation or computerised allocation systems.

If investigators are aware of the allocation group for the next individual to be enrolled in the study, there is potential for people to be enrolled in an order that results in imbalances in important characteristics. For example, a practitioner might feel that people with mild rather than severe mental health problems would be more likely to do better on a new, behavioural intervention and be tempted to only enrol such individuals when they know they will be allocated to that group. This would result in the intervention group being, on average, less severe at baseline than control group. Concealment of treatment group may not always be feasible but concealment of allocation up until the point of enrolment in the study should always be possible.

Information should be presented in the paper that provides some assurance that allocations were not known until at least the point of allocation. Centralised allocation, computerised allocation systems and the use of coded identical containers would all be regarded as adequate methods of concealment. Sealed envelopes can be considered as adequate concealment if the envelopes are serially numbered, sealed and opaque, and allocation is performed by a third party. Poor methods of allocation concealment include alternation, or the use of case record numbers, date of birth or day of the week.

If the method of allocation concealment used is regarded as poor, or relatively easy to subvert, the study should be given a lower quality rating. If a study does not report any concealment approach, this should be scored as 'not reported'.

2.4 Were participants and investigators blind to exposure and comparison?

Were participants AND investigators – those delivering or assessing the intervention kept blind to intervention allocation? (Triple or double-blinding score ++).

If lack of blinding is likely to cause important bias, score −.

Blinding refers to the process of withholding information about treatment allocation or exposure status from those involved in the study who could potentially be influenced by this information. This can include participants, investigators, those administering care and those involved in data collection and analysis.

Unblinded individuals can bias the results of studies, either intentionally or unintentionally, through the use of other effective co-interventions, decisions about withdrawal, differential reporting of symptoms, or influencing concordance with treatment.

The terms 'single blind', 'double blind' and even 'triple blind' are sometimes used in studies. Unfortunately, they are not always used consistently. Commonly, when a study is described as 'single blind', only the participants are blind to their group allocation. When both participants and investigators are blind to group allocation the study is often described as 'double blind'. It is preferable to record exactly who was blinded, if reported, to avoid misunderstanding.

It is important to note that blinding of participants and researchers is not always possible, and it is important to think about the likely size and direction of bias caused by failure to blind in making an assessment of this component.

2.5 Is the exposure to the intervention and comparison adequate?

Is reduced exposure to the intervention or control related to the intervention (for example, adverse effects leading to reduced compliance) or fidelity of implementation (for example, reduced adherence to protocol)?

Was lack of exposure sufficient to cause important bias?

2.6 Is contamination acceptably low?

Did any in the comparison group receive the intervention or vice versa?

If so, was it sufficient to cause important bias?

If a crossover trial, was there a sufficient wash-out period between interventions?

2.7 Were other interventions similar in both groups?

Did either group receive additional interventions or have services provided in a different manner?

Were the groups treated equally by researchers or other professionals?

Was this sufficient to cause important bias?

This question seeks to establish if there were any important differences between the intervention groups aside from the intervention received. If some patients received additional intervention (known as 'co-intervention'), this additional intervention is a potential confounding factor in the presence of which can make it difficult to attribute any observed effect to the intervention rather than to the other factors.

2.8 Were there other confounding factors?

Were there likely to be other confounding factors not considered or appropriately adjusted for?

Was this sufficient to cause important bias?

2.9 Were all participants accounted for at study conclusion?

Were those lost to follow-up (that is, dropped or lost pre-, during or post- intervention) acceptably low (that is, typically less than 20%)?

Did the proportion dropped differ by group? For example, were drop-outs related to the adverse effects of intervention?

Section 2 also aims to assess the likelihood of attrition bias being introduced into a study.

Attrition bias occurs when there are systematic differences between the comparison groups with respect to participants lost, or differences between participants lost to the study and those who remain. Attrition can occur at any point after participants have been allocated to their intervention groups. As such, it includes participants who are excluded post-allocation (and may indicate a violation of eligibility criteria), those who fail to complete the intervention and those who fail to complete outcome measurement (regardless of whether or not the intervention was completed).

It is a concern if the number of participants who were lost to follow-up (that is, dropped out) is high – typically >20%, although it is not unreasonable to expect a higher drop-out rate in studies conducted over a longer period of time.

Consideration should also be given to the reasons why participants dropped out. Participants who dropped out of a study may differ in some significant way from those who remained in the study. Drop-out rates and reasons for dropping out should be similar across all treatment groups. In good quality studies, the proportion of participants lost after allocation is reported and the possibility of attrition bias considered in the analysis.

2.10 Did the setting reflect usual UK practice?

2.11 Did the intervention or control comparison reflect usual UK practice?

Section 3: outcomes

Some of the items on this checklist may need to be filled in separately for each of the different outcomes reported by the study. For example, a study may report only 1 outcome of interest, measured by 1 tool, at 1 point in time, in which case each of the components (for example, reliability of outcome measure, relevance, withdrawals and drop-outs) can be assessed based on that 1 tool. However, if a study reports multiple outcomes of interest, scored by multiple tools (for example, self-report AND biochemically validated measures), at multiple points in time (for example, 6-month follow-up AND 1-year follow-up) individual components will need to be assessed for each outcome of interest.

It is important, therefore, that the reviewer has a clear idea of what the important outcomes are and over what timeframe, before appraising a study. The important outcomes for a piece of guidance will be identified through consultation with the NICE project team, the public health advisory committee and stakeholders.

3.1 Were the outcome measures reliable?

Were outcome measures subjective or objective (e.g. biochemically validated nicotine levels ++ versus self-reported smoking)?

How reliable were outcome measures (e.g. inter- or intra-rater reliability scores)?

Was there any indication that measures had been validated (e.g. validated against a gold standard measure or assessed for content validity)?

This question seeks to determine how reliable (that is, how consistently the method measures a particular outcome) and valid (that is, the method measures what it claims to measure) the outcome measures were. For example, a study assessing effectiveness of a smoking cessation intervention may report on a number of outcomes using a number of different tools, including self-reported smoking rates (a subjective outcome measure that is often unreliable) and biochemically validated smoking rates (an objective outcome measure that is likely to be more reliable).

If the outcome measures were subjective, it is also important to consider if the participant or researcher was blinded to the intervention or exposure (see question 2.4) as blinding may rescue the reliability of some subjective outcome measures.

3.2 Were the outcome measurements complete?

Were all or most study participants who met the defined study outcome definitions likely to have been identified?

3.3 Were all important outcomes assessed?

Were all important benefits and harms assessed?

Was it possible to determine the overall balance of benefits and harms of the intervention versus comparison?

3.4 Were outcomes relevant?

Where surrogate outcome measures were used, did they measure what they set out to measure? For example, a study to assess impact on physical activity assesses gym membership – a potentially objective outcome measure – but a reliable predictor of physical activity?

3.5 Were there similar follow-up times in exposure and comparison groups?

If groups are followed for different lengths of time, then more events are likely to occur in the group followed up for longer distorting the comparison.

Analyses can be adjusted to allow for differences in length of follow-up (for example, using person-years).

It is possible to overcome differences in the length of follow-up between groups in the analyses, for example, by adjusting the denominator to take the time into account (by using person-years).

3.6 Was follow-up time meaningful?

Was follow-up long enough to assess long-term benefits or harms?

Was it too long, for example, participants lost to follow-up?

The duration of post-intervention follow-up of participants should be of an adequate length to identify the outcome of interest.

Section 4: analyses

4.1 Were the exposure and comparison groups similar at baseline? If not, were these adjusted?

Were there any differences between groups in important confounders at baseline?

If so, were these adjusted for in the analyses (for example, multivariate analyses or stratification)?

Were there likely to be any residual differences of relevance?

Studies may report the distributions or important differences in potential confounding factors between intervention groups. However, formal tests comparing the groups are problematic – failure to detect a difference does not mean a difference does not exist, and multiple comparisons of factors may falsely detect some differences that are not real.

It is important to assess whether all likely confounders have been considered. Confounding factors may differ by outcome, so potential confounding factors for all of the outcomes that are of interest will need to be considered.

4.2 Intention to treat analysis?

Were all participants (including those that dropped out or did not fully complete the intervention course) analysed in the groups (that is, intervention or comparison) to which they were originally allocated?

4.3 Was the study sufficiently powered to detect an intervention effect (if one exists)?

A power of 0.8 (that is, it is likely to see an effect of a given size if one exists, 80% of the time) is the conventionally accepted standard.

Is a power calculation presented? If not, what is the expected effect size? Is the sample size adequate?

For cluster RCTs in particular, it is important to consider whether the cluster design has been appropriately taken into account in calculating required sample size for adequate power.

4.4 Were estimates of effect size given or calculable?

Were effect estimates (for example, relative risks, absolute risks) given or possible to calculate?

4.5 Were the analytical methods appropriate?

Were important differences in follow-up time, and likely confounders, adjusted for?

If a cluster design, were analyses of sample size (and power), and effect size performed on clusters (and not individuals)?

Were subgroup analyses pre-specified?

There are a large number of considerations in deciding whether analytical methods were appropriate. For example, it is important to review the appropriateness of any subgroup analyses (and whether pre-specified or exploratory) that are presented. Although subgroup analyses can often provide valuable information on which to base further research (that is, are often exploratory), it is important that findings of subgroup analyses are not over (or under) emphasised. Meaningful results from subgroup analyses are beset by the problems of multiplicity of testing (in which the risk of a false positive result increases with the number of tests performed) and low statistical power (that is, studies generally only enrol sufficient participants to ensure that testing the primary study hypothesis is adequately powered) (Assmann et al. 2000). In a good quality paper, subgroup analyses are restricted to pre-specified subgroups and are often confined to primary outcome measures. Data are analysed using formal statistical tests of interaction (that assess whether intervention effect differs between subgroups) rather than comparison of subgroup p values. A correction for multiple testing is performed where appropriate (for example, 'Bonferroni correction' where a stricter significance level is used to define statistical significance). The results are delineated carefully, and full details of how analyses were performed are provided (Assmann et al. 2000; Guillemin 2007).

The appropriateness of some analytical methods will also depend on the study design under investigation. For example, with cluster RCTs, because participants are randomised at the group level and are not independent 'units' (as is the case with RCTs based on individuals without clustering), and outcomes are often assessed at the individual level, statistical adjustments are necessary before pooled intervention and control group outcomes can be compared.

Likewise, it is also important to consider whether the degree of similarity or difference in clusters has been considered in analyses of cluster RCTs. Good quality cluster-RCTs will determine the intra-class correlation coefficient of their study (a statistical measure of the interdependence in each cluster that is calculated by taking the ratio of the variance between groups compared with variance in groups).

Studies may also report other forms of statistical analysis such as regression, time series, factor analysis and discriminant analysis, as well as epidemiological or economic modelling. Economic modelling is covered in more detail in chapter 6. The other topics are specialised and advice should be sought from the NICE project team before attempting to assess such studies.

4.6 Was the precision of intervention effects given or calculable? Were they meaningful?

Were confidence intervals or p values for effect estimates given or possible to calculate?

Were confidence intervals wide or were they sufficiently precise to aid decision-making? If precision is lacking, is this because the study is under-powered?

Section 5: summary

5.1 Are the study results internally valid (that is, unbiased)?

How well did the study minimise sources of bias (that is, any factor that skews the data in 1 particular direction) and confounding?

Were there significant flaws in the study design?

5.2 Are the findings generalisable to the source population (that is, externally valid)?

References and further reading

Assmann SF, Pocock SJ, Enos LE et al. (2000) Subgroup analysis and other (mis)uses of baseline data in clinical trials. The Lancet 355:1064–9

Effective Public Health Practice Project Quality assessment tool for quantitative studies

Guillemin F (2007) Primer: the fallacy of subgroup analysis. Nature Clinical Practice Rheumatology 3: 407–13

Heller RF, Verma A, Gemmell I et al. (2008) Critical appraisal for public health: a new checklist. Public Health122: 92–8

Jackson R, Ameratunga S, Broad J et al. (2006) The GATE frame: critical appraisal with pictures. Evidence based Medicine11: 35–8

Gough D, Oliver S, Thomas J (2012) (eds) An Introduction to Systematic Reviews, London: Sage.

Rothwell PM (2005) External validity of randomised controlled trials: to whom do the results of this trial apply? Lancet 365: 82–93

^[12] Appraisal form derived from: Jackson R, Ameratunga S, Broad J et al. (2006) The GATE frame: critical appraisal with pictures. Evidence Based Medicine 11: 35–8.

^[13] Available from CPHE on request.

^[14] Demographic criteria as outlined by the PROGRESS-Plus categorisation (Kavanagh et al. 2008).