3 Approach to evidence generation

3.1 Evidence gaps and ongoing studies

Table 1 summarises the evidence gaps and ongoing studies that might address them. Information about evidence status is derived from the external assessment group's report; evidence not meeting the scope and inclusion criteria is not included. The table shows the evidence available to the committee when the guidance was published.

**Table 1 Evidence gaps and ongoing studies**
Evidence gap	BoneView (Gleamer)	RBfracture (Radiobotics)	Rayvolve (AZmed)	TechCare Alert (Milvue)
Diagnostic accuracy	Evidence is available Ongoing study	Limited evidence available Ongoing study	Limited evidence available	Limited evidence available
Clinical and service outcomes	Limited evidence available Ongoing study	Limited evidence available Ongoing study	Limited evidence available	No evidence
Effectiveness in different subgroups	No evidence	No evidence	No evidence	No evidence
Costs associated with establishing the infrastructure needed to implement the AI technology	No evidence	No evidence	No evidence	No evidence

3.2 Data sources

Most of the data, particularly that relating to diagnostic accuracy, is likely best collected through primary data collection. There are data sources that may collect some of the necessary outcome information, however they will require linking to each other and the primary data collection.

NICE's real-world evidence framework provides detailed guidance on assessing the suitability of a real-world data source to answer a specific research question. Potential data sources include:

The NHS picture archiving and communication system (PACS) will also be a useful resource.

Local or regional data collections such as NHS England's sub-national secure data environments could potentially be used to collect information and link data sources together. Secure data environments are data storage and access platforms that bring together many sources of data, such as from primary and secondary care, to enable research and analysis. The sub-national secure data environments are designed to be agile and can be modified to suit the needs of new projects, as would be necessary in this instance.

The quality and coverage of real-world data collections are of key importance when used in generating evidence. Active monitoring and follow up through a central coordinating point is an effective and viable approach of ensuring good-quality data with broad coverage.

3.3 Evidence collection plan

The suggested approaches to addressing the evidence gaps are an experimental concordance study with existing imaging data and a real-world prospective study.

Centres that best represent urgent care centres in the NHS and its variation across centres (considering for example, patient volume and number of readers), should be included to address confounders and allow subgroup analyses. Sample populations should be representative, considering, for example, age, sex, ethnicity and socioeconomic status.

Concordance study to assess diagnostic accuracy

A concordance study is used to assess the agreement between 2 or more methods.

Each case should include clinical data available at the time of scanning in line with standard care for each of the methods being compared. This study will assess the concordance between the diagnosis reached for each included case by the:

healthcare professional assisted by AI technology (intervention)
healthcare professional unassisted by AI technology (comparator)
consultant radiologist or reporting radiographer interpretation and report (ground truth).

Prospectively collected anonymised image sets would be provided by emergency centres and processed to determine the diagnosis by the intervention and comparator and the ground truth. Ideally, cases should be randomly allocated to readers to minimise potential bias.

Any cases that the technologies were unable to analyse should be recorded for further investigation. Discordant cases could be further explored to identify common characteristics, and reasons for discordance.

Comparison between AI-assisted (intervention), and unassisted (comparator) readings, and the experienced consultantor radiographer report and review (reference standard) would allow assessment of diagnostic accuracy. It is possible that linked clinical outcomes could also provide evidence of whether a fracture was missed by the AI technology or human review when the patient returned at a later date.

As part of data collection process, performance of the AI technology alone should be collected. Although this data is not relevant directly to how the AI technology would be deployed in the NHS, it enables separation of software and human components of performance, and will allow monitoring, updating and direct comparison of future technologies. The combined performance may be sensitive to change in training level of users, or unassisted diagnostic practices. Measurement of AI performance alone is a useful marker as a lower bound to identify drift from intended use due to automation bias.

The diagnostic accuracy should be also assessed in applicable subgroups, such as children and young people, and people with conditions that affect bone health. It is important to also consider readers with varying levels of experience.

Real-world prospective study and embedded qualitative study

To address the evidence gaps, a prospective real-world study is suggested. Ideally, this would compare outcomes in a period before implementation of the technology to a period after deployment.

This study could be done at a single centre or, ideally, replicated across multiple centres to show how the technology can be implemented across a range of services, representative of the variety in the NHS. Some outcomes may reflect other changes unrelated to the interventions that occur over time in the population. To control for these changes over time that might occur anyway, additional robustness can be achieved by collecting data in a centre that has not implemented the technology.

High-quality data on patient characteristics may be needed to identify and correct for any important differences between comparison groups and to assess who the technologies would not be suitable for. Important confounding factors should be identified with input from clinical experts during the protocol development.

Information to be collected in this study is detailed in section 3.4.

An embedded qualitative study is suggested to collect information on ease of use, and trust and acceptability of the AI technology by clinicians and patients. Repeat, in-depth interviews could be held with staff before and after the implementation of the technologies in services. This could also examine aspects of learning how to use the AI technology. A longitudinal design could support understanding of how experiences may evolve over time and the processes involved in change. Patient's perspectives could also be captured through focus groups. Semi-structured interviews should be recorded and fully transcribed and a thematic analysis approach employed.

Data collection should follow a predefined protocol and quality assurance processes should be put in place to ensure the integrity and consistency of data collection. See NICE's real-world evidence framework, which provides guidance on the planning, conduct, and reporting of real-world evidence studies.

3.4 Data to be collected

The following information has been identified for collection:

Diagnostic accuracy study

Diagnoses made by the AI-assisted healthcare professional, the unassisted healthcare professional, and the experienced reviewer. Also, ideally, diagnoses by AI technology alone.
Number and proportion of images not eligible for processing by the AI technology (for example, because of technical or software failure) and reasons given.
Performance of the different methods compared to the ground truth. Performance estimates should include overall accuracy, sensitivity, specificity, positive predictive value, negative predictive value and c-statistic. Number of true positives, false positives, true negatives and false negatives should also be reported.
Performance of the different methods among different subgroups such as age, sex, ethnicity, socioeconomic status and conditions that affect bone health.
Cases of diagnostic disagreement and the likely reason for disagreement.
Cases of missed fractures by the AI technology.
Time spent on review, with or without the AI technology.

Real-world prospective study

Clinical and service outcomes. These should be analysed considering factors such as type of fracture.
Clinical outcomes associated with missed diagnosis or misdiagnosis, for example, unnecessary treatments, further diagnostic procedures, or complications from misdiagnosis, ideally with quality-of-life impact.
The number and proportion of people being recalled to hospital after radiology review.
Incidence of further injury or harm during the time between the initial interpretation and treatment decision in urgent care and the definitive radiology report.
Total number of referrals to fracture clinics, and number and proportion of unnecessary referrals.
Rate of detection of non-fracture-related conditions by the AI technologies or failure to detect non-fracture-related conditions highlighted by the reporting healthcare professional.
Costs associated with establishing the infrastructure needed to implement the AI technologies.
Costs associated with maintaining the infrastructure needed for the AI technologies, including software, hardware and staff training.
Ongoing costs like system updates and technical support.

Information about the technologies

Information about how the technologies were developed, the update version tested, and how the effect of future updates will be monitored should also be reported. See the NICE evidence standards framework for digital health technologies.

3.5 Evidence generation period

This will be 2 years to allow for setting up and implementing the AI technologies, and for data collection, analysis and reporting.

What we do

Into practice

Who we are

How are you taking part in this consultation?

Evidence generation plan