Assessing data suitability

Introduction

Data used to inform NICE guidance should be reported transparently and be of good provenance and fit for purpose in relation to the research question. The primary aims of this section of the framework are to:

provide clear guidance to evidence developers about expectations for clear and transparent reporting on data and its fitness for purpose
enable evidence reviewers and committees to understand data trustworthiness and suitability when critically appraising the study or developing recommendations.

This section should be read alongside the section on conduct of quantitative real-world evidence studies.

We do not define minimum standards for data suitability beyond that the data should be used in accordance with national laws and regulations concerning data protection and information governance (see the section on reporting on data sources). The considerations for data suitability are broadly applicable across different types of real-world data and use cases but are largely focused on quantitative studies.

The acceptability of a data source will depend on the use case, and contextual factors (see the section on considerations for the quality and acceptability of real-world evidence studies). We recognise the need for trade-offs between different characteristics of data sources including quality, size, clinical detail and locality. International data may be appropriate for some questions in the absence of sufficient national data or when results are expected to translate well between settings. We also recognise that there may be challenges in identifying or collecting the highest quality evidence in some applications including in rare diseases and for some medical devices and interventional procedures (see the section on challenges in generating real-world evidence).

We do not request a particular format for the overall presentation of this information. However, we have developed the Data Suitability Assessment Tool (DataSAT) to help the consistent and structured presentation of data suitability at the point of assessment. The concepts presented in the tool may also help developers choose between potential data sources and in performing feasibility studies, but this is not its primary purpose. The tool template and example applications are presented in appendix 1.

Data provenance

A full understanding of data provenance is essential to create trust in the use of data and understand its fitness for purpose for a given application. In this section, we present data provenance considerations across 4 themes: basic characteristics of the data source, data collection, coverage and governance.

Many real-world evidence studies will combine more than 1 data source, either by data linkage or data pooling. Data linkage is often done to extend the information available on individual patients, for example, by combining data from a prospective observational cohort study with hospital discharge or mortality records, or patient-generated health data. Data pooling is used to extend sample size or coverage of data and is common in studies of rare diseases.

The reporting of data sources should primarily refer to the combined data used for the research study. However, important differences between contributing datasets should be clearly described.

Basic characteristics of data

Information that allows identification of the data sources should be clearly reported. This includes the names of the overall and contributing data sources, versions (if available) and the dates of data extraction.

Common data models are used to standardise the structure and sometimes coding systems of different data sources. If data has been converted to a common data model, the model and its version should be reported and full details of the mapping made available, including any information loss. This information is essential to allow the study to be reproduced.

Common data models can also support the use of federated data networks. These allow individual patient health data to stay under the protection of partnering data holders who will run standardised analyses before results are aggregated across datasets. Reporting of federated data networks should be sufficient to understand the process of recruiting data partners, feasibility assessments, and the common analytical framework used.

While complete and accurate data linkage will improve the quality and value of data, imperfect linkage could exclude patient records or lead to data misclassification. Therefore, when multiple sources of data are linked, the following information should be reported:

who did the linkage (for example, NHS Digital)
methods of linkage including whether deterministic or probabilistic, and the variables used for linkage
the performance characteristics of data linkage (see the Government Analysis Function guidance on quality assessment in data linkage).

Data collection

An understanding of a data source requires knowledge of the purpose and methods of data collection.

Information on the original purpose of data collection should include:

whether the data was routinely collected or collected for a specific research purpose (or a combination)
the type of data source and primary use, for example:
- electronic health records for patient care
- administrative data for reimbursement of providers
- registry for assessing medical device safety
- prospective observational cohort study to estimate quality of life after an intervention
- retrospective chart review to model the natural history of a condition.

Additional information on important data types should cover:

which types of data were collected, for example, clinical diagnoses, tests, procedures and prescriptions
how these were coded or recorded, for example, using ICD-10 codes for clinical diagnoses, or free-text data on cancer stage or biomarkers
how data was collected, for example, directly by healthcare professionals in clinical examinations, by remote monitoring or by administrative staff. If data is captured by a digital health technology, the validity of the technology should be reported
changes to data collection over time, for example:
- addition of new data elements (for example, a quality-of-life questionnaire)
- removal of data elements
- changes to the method of data collection (for instance, a switch to routine monitoring of patient outcomes)
- changes to coding systems (for example, the switch from Read v2 to SNOMED-CT codes in UK primary care); information on any mapping between coding systems should be made available
- software updates to data capture systems including digital health technologies that had substantial impacts on data capture.
quality assurance processes for data collection that were in place (including training or blinded review)
transformations performed on the data such as conversion to a common data model or other data standards.

Any differences between data providers in how and what data were collected, and its quality, should be described. This is especially important when data sources are pooled from different systems and across countries.

Data coverage

Providing clear information on data coverage is essential, including the population, care settings, geography and time. Such information has important implications for data relevance that can inform later assessments of data suitability.

Information should be provided on:

the extent to which the data source captures the target population:
- if a data source does not include the full target population, the representativeness of the data captured should be noted
- for studies involving prospective data collection including patient registries, information on patient accrual should be reported
the settings in which data collection was based:
- this should distinguish between care settings (for example, primary care compared with secondary care), type of providers (for example, specialist medical centres compared with general hospitals) and other factors when relevant
- if information was collected outside of the health or social care system, this should be described (for instance, remote monitoring of activities of daily living)
the geographical coverage of the data including countries and regions, if relevant
the time period of data collection.

Data governance

Information about data governance is important for understanding the maturity of data and its reliability. This should include the following information:

the name of the data controller
the funding source for data collection and maintenance
data documentation including items such as a data dictionary and data model
details of the quality assurance and data management process including audit.

Data fitness for purpose

The section on data provenance described important characteristics of data sources distinct from the planned study. In this section, we focus on the fitness for purpose of data to answer specific research questions considering its quality and relevance. A dataset may be of value for 1 application but not another.

Substantial data curation including data cleaning, exclusions and transformations is needed to prepare original data sources for analysis. Data curation and quality assurance should be reported transparently as described in the section on study reporting.

Data quality

Limitations to data quality include missing data, measurement error, misclassification and incorrect reporting of dates. These issues can apply to all study variables including patient eligibility criteria, outcomes, interventions or exposures, and covariates. They can create information biases that cause real-world evidence studies to produce biased estimates. Transparent reporting of data quality is essential for reviewers to understand the risk of bias and whether it has been adequately addressed through data analysis or explored through sensitivity analysis. We focus on 2 main aspects of data quality: completeness and accuracy.

Information on completeness and accuracy should be provided for all key study variables. Study variables can be constructed by combining multiple data elements, including both structured data and unstructured data, and may come from different linked data sources. The complexity of these study variables will vary according to the data sources and applications. For instance, in some applications, an asthma exacerbation may be identified from a single data field (such as the response to a questionnaire), while in others, it may need to be constructed from combinations of diagnostic codes, prescriptions, tests, free text or other data. Where this is done, differences across these data sources such as variation in the detail, completeness, recording practices and anonymisation techniques can impact the validity of the constructed variable. These factors should be carefully considered and clearly documented.

As described in the section on study reporting, it is essential that clear and unambiguous definitions are given for each study variable including types of data, code lists, extraction from unstructured data, and time periods, when possible. These operational definitions including code lists should be made available to others and reused, if appropriate. Tools are available to support best practice in code list development and sharing (for example, Matthewman et al. 2024). The validity of an existing code list should be reviewed before use.

These considerations also apply to data from digital health technologies producing patient-generated data, including patient-reported outcomes and digital biomarkers. Further information on the validity of data generated from the technology and user accessibility should be provided.

To interpret study results, further information is needed on reasons for data missingness and inaccuracy and whether these are random or systematic. For comparative studies, it is important to understand the extent to which missingness or inaccuracy differ across intervention groups. The section on addressing information bias has further information on methods for dealing with missing data, measurement error and misclassification. We have not set minimum thresholds for data completeness or accuracy because the acceptable levels will depend on the application (see the section on considerations for the quality and acceptability of real-world evidence studies).

Completeness

Data completeness refers to the percentage of records without missing data at a given time point. It does not provide information on the accuracy of that data. The percentage is often easily calculated from the data source and should be calculated before excluding relevant data or imputation. For outcomes such as experiencing a myocardial infarction, issues of data missingness should be clearly distinguished from misclassification. For binary variables, the absence of an event (when it has occurred) may be best summarised as a data accuracy issue (misclassification due to false negatives).

Accuracy

Measuring accuracy, or how closely the data resembles reality, depends on the type of variable. Below we describe common metrics of accuracy for different types of variables:

continuous or count variables (mean error, mean absolute error, mean squared error)
categorical variables (diagnostic accuracy measures such as sensitivity, specificity, positive predictive value, and negative predictive value; Fox et al. 2020)
time-to-event variables (difference between actual time of event and recorded time of event).

Gold standard approaches for measuring accuracy of the data include:

comparison with an established gold standard source (for example, UK Office for National Statistics mortality records)
medical record review.

These approaches may be taken for a subset of the analytical population or be based on a previous study in the same or similar population and data source.

These gold standard approaches are not always possible or feasible. Other approaches that can show approximate accuracy include:

comparing different variable definitions, for example, by using additional codes, requiring multiple codes, or combining different data types
comparing sample distributions with population distributions or previous studies
exploring plausibility of the data, informed by expert opinion
checking consistency (agreement in patient status in records across the data sources)
assessing conformance (whether the recording of data elements is consistent with the data source specifications)
checking persistence (whether the data is consistent over time).

Transparent reporting of data accuracy for key study variables includes:

Quantitative information on accuracy, if available, including means and confidence intervals. Additional distributional information may also be valuable.
Describing the methods and processes used to quantify accuracy including any assumptions made. When this is based on previous studies, the applicability to the present analysis should be discussed, and may consider differences in study variable definitions, populations, data sources, time periods or other relevant considerations.

Validity of information extracted from unstructured data

When unstructured data (for example, from free-text clinical notes) is used to derive structured data for analysis, a clear definition of the rules used to extract the variable should be provided, explicitly detailing the semantic and syntactic context considered during the extraction process (for example, negation or temporal aspects in clinical notes).

For manual methods such as human abstraction, the skillset of chart reviewers, extraction schema used, inter-reviewer agreement, and information sources available to the abstractors beyond the unstructured data should be reported.

Where algorithmic methods are used, additional information should be provided on:

related data pre-processing or transformation steps for unstructured data such as conversion to machine-readable formats (including any software and versions used to do this) and results of any fidelity assessments
the nature of the algorithm or model used (for example, rules-based, machine learning, large-language model and the source, version and owner) including a link to a model card, where available
where large-language models are used, information on the model architecture and configuration (for example, fine-tuning, how parameters were chosen, prompting strategy).

Information should also be provided on the performance and reliability of algorithmic data extraction methods including details of validation studies, published or available on request, outlining:

how the training and test set were developed and justification of sample sizes
the skillset of annotators, annotation criteria used, inter-annotator agreement, and process for resolving disagreements
differences in characteristics between training and test data and the analytical dataset in which the algorithm was applied; for example, differences in data sources, the sampling frame, population characteristics, healthcare settings, time period or use of synthetic data may impact generalisability of validation studies
information available to human reviewers versus the software (that is, where the human annotator draws from sources beyond the data available to the software, or makes a judgement based on less information, this can impact fair assessments of validity)
performance metrics (for example, positive predictive value ['precision'], sensitivity ['recall'], specificity, F1 ratio); appropriate metrics should be justified and may vary depending on the concept (for example, the use of mean absolute error or windowed precision or recall may be used for treatment start date)
the sampling unit of evaluation, for example, patient-level or instance-level
model performance for important subpopulations (e.g., fairness of performance)
an assessment of the potential impact of model errors for the intended analytical use.

Estevez et al. (2022) provides considerations for the use of machine-learning extracted data, including a principled approach to assessment of stratified performance and the impact of model errors. The VALID framework (pre-print) extends considerations to the validation of large-language models and machine-learning extracted information from electronic healthcare records. Wang et al. (2019) provides detailed information on reporting algorithmic methods applied to unstructured data.

Data relevance

The second component of data fitness for purpose is data relevancy. Key questions of data relevancy are whether:

the data provides sufficient information to produce robust and relevant results
the results are likely to generalise to patients in the NHS.

The assessment of data relevancy should be informed by the information provided in the section on data provenance.

NICE prefers data relating directly to the UK population that reflects current care in the NHS. However, we recognise the potential value of international data if limited information is available for the NHS or if results can be expected to translate well between settings. In some applications, there will be a trade-off between using local data and other important characteristics of data including quality, recency, clinical detail, sample size and follow up. International data is likely to be of particular value when an intervention has been available in another country before becoming available in the UK, or in the context of rare diseases. Similar considerations apply to using data from regional or specialist healthcare providers within the NHS.

We describe key aspects of data relevancy below distinguishing between data content, coverage and characteristics.

Data content

There are 3 key considerations for understanding whether the data content is sufficient for a research question:

Does the data source contain sufficient data elements to enable appropriate definitions of population eligibility criteria, outcomes, interventions and covariates, as relevant?
Are the data elements collected with sufficient granularity (or detail)?
Are measurements taken at relevant time points?

To help understand whether data elements are sufficient, it is useful to first define the target concept and judge the extent to which this can be proxied using real-world data. The implications of insufficient data will vary depending on the study variable and use case. Key endpoints necessary to answer the research questions should be available and should be sufficiently objective and detailed to support an evaluation. Insufficient information to define the population, interventions or outcomes appropriately will limit the relevance of the research findings. Insufficient information on confounders will limit the ability to produce valid findings.

The needed granularity of data will vary across research questions. For example, when considering the effect of knee replacement on quality of life, we may be interested in the effect compared with physiotherapy alone, total versus partial knee replacement, or of different implanted devices. Similarly, any stroke may be appropriate as an outcome for some research questions, while others will need haemorrhagic and ischaemic strokes to be separated.

Finally, we may be interested in the effect of knee replacement on quality of life at 1 year after the procedure. In routinely collected data, the recording of such information does not follow a strict protocol with measurements missing or taken at irregular time points.

Data coverage

The generalisability of research findings to patients in the NHS will depend on several factors, including:

the similarity in patient characteristics between the analytical sample and target population
the similarity in care pathways and treatment settings
changes in care pathways (including diagnostic tests) and outcomes over time.

The similarity of the analytical sample to the target population is especially important in descriptive studies, such as those estimating disease prevalence. In comparative studies, this may be less important if the intervention effects are expected to transfer across patients with different characteristics, and the emphasis should be on ensuring internal validity. If there is substantial heterogeneity in treatment effects across subgroups, similarity in patient characteristics becomes more important. Effect estimates on the relative scale usually transfer better across subgroups than estimates on absolute scales (Roberts and Prieto-Merino 2014). In other applications, such as prognostic modelling, non-representative sampling may be preferred to ensure adequate representation of important patient subgroups.

Consideration needs to be given to how any differences in the treatment pathways or care settings seen in the analytical sample and the NHS may impact on the relevance of results. This is especially important when using international data. Even within the NHS, the data may relate to specific regions that are not representative of the country or focus on specialist providers rather than all providers. Finally, changes to care pathways including diagnostic tests as well as background trends in outcomes (such as mortality) may limit the value of historical data even from the NHS. These issues need to be carefully considered and reported when discussing the relevance of data for use in NICE guidance.

Data characteristics

The final category of data relevancy concerns the size of the analytical sample and the length (and distribution) of follow up. The sample size should be large enough to produce robust estimates. However, we recognise that sample size will always be limited in some contexts. The follow up should be long enough for the outcomes of interest to have occurred or accrued (for outcomes such as healthcare costs). The amount of data available before the start of follow up may also be important to provide information on confounders and identify new users of an intervention. Using data sources with a lower time lag between data collection and availability for research may allow for longer follow up to be available for analyses.

NICE real-world evidence framework

Key messages

Introduction

Data provenance

Basic characteristics of data

Data collection

Data coverage

Data governance

Data fitness for purpose

Data quality

Completeness

Accuracy

Validity of information extracted from unstructured data

Data relevance

Data content

Data coverage

Data characteristics