Corporate document
Appendix 1 – Data Suitability Assessment Tool (DataSAT)
Appendix 1 – Data Suitability Assessment Tool (DataSAT)
See tools and resources for a downloadable DataSAT assessment template.
DataSAT assessment template
Research question
Add the research question here.
Item | Response |
---|---|
Data sources |
For each contributing data source provide the name, version and date of data cut. Provide links to their websites, if available. |
Data linkage and data pooling |
Report which datasets were linked, how these were linked, and performance characteristics of the linkage. Note whether linkage was done by a third party (such as NHS Digital). Clearly describe which data sources were pooled. |
Type of data source |
Describe the types of data source (for example, electronic health record, registry, audit, survey). |
Purpose of data collection |
Describe the main purpose of data collection (for example, clinical care, reimbursement, device safety, research study). |
Data collection |
Describe the main types of data collected (for example, clinical diagnoses, prescriptions, procedures, patient experience data), how data was recorded (for example, clinical coding systems, free text, remote monitoring, survey response), and who collects the data (for example, healthcare professional, self-reported, digital health technology). If the nature of data collection has changed during the data period (for instance, change in coding system or practices, data capture systems) describe the changes clearly. Any differences between data providers in how and what data were collected and its quality should be described. If additional data collection was done for a research study please describe, including how the validity and consistency of data collection was assured (for example, training). |
Care setting |
State the setting of care for each dataset used (for example, primary care, secondary care, specialist health centres, social services, home use [for wearable devices, or self-reported data on apps or websites]). |
Geographical setting |
State the geographical coverage of the data sources. |
Population coverage |
State how much of the target population is represented by the dataset (for example, population representativeness or patient accrual). |
Time period of data |
State the time period covered by the data. |
Data preparation |
Provide details of whether raw data were accessed for analysis, or whether the data owner had undertaken any data preparation steps such as cleansing or transformation. Mention whether centralised transformation to a common data model was undertaken. Include links to any relevant information including common data model type and version number and details of mapping. Full details of data preparation specific to addressing the research question is covered in the section on reporting on data curation. |
Data governance |
Provide the details of the data controller and funding for each source. Describe the information governance processes for data access and use. |
Data specification |
Note whether a data specification document is available. This may include a data model, data dictionary, or both. |
Data management plan and quality assurance methods |
Note whether a data management plan, documentation of source quality assurance methods is available with links to relevant documents. |
Other documents |
Note whether any other documentation is available. Provide hyperlinks or citations to key publications, if available. If the dataset is available from the Health Data Research UK (HDRUK) innovation gateway, provide the hyperlink to its profile on the HDRUK website. |
Data quality
Study variable | Target concept | Operational definition | Quality dimension | How assessed | Assessment result |
---|---|---|---|---|---|
What type of variable (for example, population eligibility, outcome) |
Define the target concept (for example, myocardial infarction [MI]) |
Define operational definition. For example, MI defined by an ICD-10 code of I21 in the primary diagnosis position |
Choose: accuracy or completeness |
Describe how quality was assessed. Provide reference to previous validation studies if applicable. |
Provide quantitative assessment of quality if available. For example, 'positive predictive value 85% (75% to 95%)' |
Data relevance
Item | Response |
---|---|
Population |
Describe the extent to which the analytical sample reflects the target population. This should consider any data exclusions (for example, because of missing data on key prognostic variables). |
Care setting |
Describe how well the care settings reflect routine care in the NHS. |
Treatment pathway |
Describe how the treatment pathways experienced by people in the data reflects routine care pathways in the NHS (including any diagnostic tests). |
Availability of key study elements |
Note how the dataset met the requirements of the research question in terms of availability of the necessary data variables including key population eligibility criteria, outcomes, intervention and covariates (including confounders and effect modifiers). |
Study period |
State the extent to which the time period covered by the data provides relevant information to decisions. This should cover any important changes to care pathways (including tests) or background changes in outcome rates. |
Timing of measurements |
Describe whether the timing of measurements meet the needs of the research question. |
Follow up |
Note how the follow-up period available in the dataset is sufficient for assessing the outcomes. |
Sample size |
Provide the sample size of the target population in the dataset and demonstrate that it is adequate to generate robust results. |
DataSAT – case study
Please note that the reporting for this case study is based on publicly available information in Wing et al. 2021.
Research question
What is the effect of the long-acting beta-2 agonist and inhaled corticosteroid combination product fluticasone propionate plus salmeterol compared with no exposure or exposure to salmeterol only in people with chronic obstructive pulmonary disease (COPD)?
Item | Response |
---|---|
Data sources |
Clinical Practice Research Datalink (CPRD) GOLD Hospital episode statistics (HES) Admitted Patient Care data. |
Data linkage and data pooling |
CPRD and HES are linked. Patients are identified in a centralised linkage algorithm done by NHS digital. This uses an 8-step deterministic linkage algorithm based on 4 identifiers: NHS number, sex, date of birth and postcode. Linkage to HES data is possible for 75% of enrolled patients. See information on linked data for CPRD. |
Type of data source |
HES = administrative records CPRD = electronic health records |
Purpose of data collection |
Hospital Episode Statistics (HES) is derived from the Secondary Uses Service (SUS) data based on information submitted to NHS digital by healthcare providers. Data collection is primarily intended to support the reimbursement of hospitals for the provision of services in England. CPRD collects anonymised patient data from a network of GP practices across the UK. Initially this data is collected during a patient's time in primary care services. |
Data collection |
CPRD = demographics, clinical diagnoses (Read v2 or SNOMED-CT), tests (medcode or SNOMED-CT), prescriptions (prodcode) including dose, route of administration and duration. CPRD GOLD collects fully coded patient electronic health records from GP practices using the Vision software system. Data are recorded by health and care staff working within the Vision software. HES = diagnoses (ICD-10), procedures (OPCS-4), admission, discharge, type of care, basic demographics. HES data are collected during a patient's time at hospital and may be recorded during their interactions with health and care staff in the hospital and assembled by teams of clinical coders. |
Care setting |
HES = secondary care CPRD = primary care |
Geographical setting |
HES = England CPRD = a representative sample of UK general practices using Vision software. HES-linked CPRD data is available for England only. |
Population coverage |
CPRD GOLD has data for about 3 million currently registered people (around 4.74% of UK population). See CPRD data highlights HES data covers all NHS Clinical Commissioning Groups in England. |
Time period of data |
The CPRD-linked HES dataset covers from January 2000 to January 2017. |
Data preparation |
No details available for CPRD. However, general practices are included only after demonstrating their records are of research quality. HES applies centralised processing before the data are released for research: The rules that run during the processing of the HES data set. These are in place to improve the value and quality of the data and include rules that validate the data within certain fields, derive additional fields and values, remove records that are invalid or out of scope for the HES data set. |
Data governance |
CPRD is a centre of the MHRA, which is an executive agency of the Department of Health & Social Care (DHSC). DHSC is therefore the data controller for CPRD data. HES data is controlled by the Health and Social Care Information Centre (also known as NHS Digital). CPRD has received funding from the MHRA, Wellcome Trust, Medical Research Council, NIHR Health Technology Assessment programme, Innovative Medicines Initiative, UK Department of Health, Technology Strategy Board, Seventh Framework Programme EU, and various universities, contract research organisations and pharmaceutical companies. HES data collection is mandated and funded by the UK Government. |
Data specification |
Fields in HES are derived from the NHS data model and the NHS data dictionary. |
Data management plan and quality assurance methods |
HES undertakes processing and data quality checks: The processing cycle and HES data quality. No data quality assurance information was identified for CPRD GOLD. However, records from individual general practices are assessed and only included in CPRD after being deemed of research quality. |
Other documents |
None. |
Study variable | Target concept | Operational definition | Quality dimension | How assessed | Assessment result |
---|---|---|---|---|---|
Population |
COPD |
CPRD diagnostic (Read v2) codes for COPD (see codelist in supplementary material of Quint et al. 2014) |
Accuracy |
Previously published validation study comparing algorithms for identifying people with COPD with physician review questionnaire as gold standard (Quint et al. 2014) |
Positive predictive value (PPV): 87% (95% Confidence interval [CI] 78% to 92%) |
Population |
Disease severity |
Global Initiative for Chronic Obstructive Lung Disease (GOLD) stage derived from spirometry measurements (see codelist) |
Completeness |
Proportion of patients with missing spirometry data |
20% |
Intervention |
Fluticasone propionate + salmeterol |
CPRD prescribing record matching definition of drug treatment determined by codelist |
Accuracy |
CPRD prescribing data is expected to be highly accurate |
n/a |
Outcome |
COPD exacerbation |
Any of the following: CPRD diagnostic (Read) code for lower respiratory tract infection or acute exacerbation of COPD A prescription of a COPD-specific antibiotic combined with oral corticosteroid (OCS) for 5 to 14 days A record (Read code) of 2 or more respiratory symptoms of AECOPD with a prescription of COPD-specific antibiotics and/or OCS on the same day. See codelist |
Accuracy |
Previously published validation study comparing algorithms for identifying people with COPD exacerbations with physician review questionnaire as gold standard (Rothnie et al. 2016) |
PPV: 86% (95% CI 83% to 88%) Sensitivity: 63% (95% CI 55% to 70%) |
Outcome |
All-cause mortality |
Record in Office for National Statistics (ONS) mortality statistics (centrally linked to CPRD data) |
Accuracy |
ONS mortality records are the gold standard data for deaths |
n/a |
Covariate (confounder) |
Alcohol intake |
Reported directly in CPRD (closest to index date) |
Completeness |
Proportion of patients with missing data on alcohol intake |
30% |
Item | Response |
---|---|
Population |
Patients in CPRD have similar demographic characteristics to the wider UK population. Results from CPRD are generally expected to generalise to the wider eligible population. Complete records analysis was done excluding records with missing data on socioeconomic status, alcohol consumption and BMI. All these variables had less than 5% of the data missing. Around one-fifth of patients were excluded because they did not have spirometry measurements recorded in the CPRD. Those without measurements tend to have less contact with health services, which could impact on the generalisability of results. |
Care setting |
Appropriate. COPD drugs are typically administered in primary care (CPRD) while relevant events may be observed in primary or secondary care (CPRD or HES). |
Treatment pathway |
The data represents routine practice in the NHS. |
Availability of key study elements |
Sufficient data on exposures and outcomes are available. Although only prescribing and not dispensing data is available from CPRD this is expected to be a good proxy for dispensing. No information was available on negative reversibility spirometry results which may be a key confounder. Dosage information is limited in CPRD. |
Study period |
There have been no major changes to UK clinical practice for the management of COPD since the study period. |
Timing of measurements |
The longitudinal nature of the analysis allows for the research question to be answered. The date of entry is expected to reflect the actual timing of clinical events well. |
Follow up |
The average follow up of 2 years is sufficient for the primary outcome of COPD exacerbations to have occurred. |
Sample size |
The needed sample size for COPD exacerbations was estimated to be 600 per arm at 80% and 5% significance (see Wing et al. 2021 for details). The actual sample size of about 2,500 per arm far exceeds this. |