4 Approach to evidence generation
An approach to addressing the evidence gaps through real-world data collection is considered, and any strengths and weaknesses highlighted.
Most technologies do not have ongoing studies that will address the evidence gaps, although Lunit INSIGHT CXR has ongoing research that may address some of the gaps. So, for these technologies, additional evidence generation is necessary.
qXR has ongoing research that may address all the essential and important evidence gaps and may not need additional evidence generation.
4.1 Evidence generation plan
For technologies lacking information about diagnostic accuracy and technical failure rates, diagnostic accuracy studies should be done to show this.
Other evidence gaps can be addressed through a real-world historical control study alongside a qualitative survey.
Diagnostic accuracy study
This could be done as a diagnostic cross-sectional study. The study would compare agreement between clinical reviewer alone and clinical reviewer aided by the software for identification of abnormal X-rays (needing CT follow‑up). It would be possible to report accuracy (including sensitivity, specificity, negative predictive values and positive predictive values), variation across reviewers as well as technical failure rates.
Real-world historical control study
A historical control study could compare outcomes before and after the implementation of artificial intelligence (AI) software. This could assess the number and proportion of chest X-rays referred to CT scan, time from chest X-ray to completion of the report, number of chest X-rays assessed per reviewer per day, time from receipt of chest X-ray to CT scan report. The grade of NHS staff reviewing and reporting should also be collected.
This study could also collect additional diagnostic outcomes comparing AI-assisted review to reviewer alone. The study should assess whether abnormal findings on an X-ray correspond to disease-related abnormal findings on a follow-up CT scan (the reference standard). This would measure the positive predictive value aspect of diagnostic accuracy. Technical failure rates should also be reported. Information on number of cancers detected and stage of cancer at detection could be collected.
The study could also collect information on missed cancers among those who were not referred for chest CT during the study period, although this would give a biased estimate of false-negatives because not all missed cancers may be picked up over the observation period.
Data collection for each technology could be at a single centre or ideally across multiple centres. The study should also collect data on implementation costs for these technologies in routine clinical practice.
Qualitative survey
A qualitative survey is suggested to collect information on ease of use and acceptability of the software by clinicians. The format of the survey should include open-ended questions to give people the freedom to provide detailed insight. A range of views and perspectives should be collected that is representative of participating clinical reviewers at the sites where the technology is implemented.
4.2 Real-world data collections
The NHS England Secure Data Environment (SDE) service could potentially support evidence generation. This platform provides access to high standard NHS health and social care data that can be used for research and analysis. The Diagnostic Imaging Data Set within this service may be useful because it collects information about diagnostic imaging that people have and can be linked to other datasets.
There may be local or regional data collections that collect outcome measures specified in the research recommendation. The sub-national secure data environments could be a regional data collection alternative.
The quality and coverage of real-world data collections are of key importance when used in generating evidence. Active monitoring and follow‑up through a central coordinating point is an effective and viable approach of ensuring good-quality data with high coverage. NICE's real-world evidence framework also provides detailed guidance on assessing the suitability of a real-world data source to answer a specific research question.
4.3 Data to be collected
The following outcomes have been identified for collection through the suggested studies:
Quantitative
-
time from chest X-ray to report
-
time from chest X-ray to CT scan report
-
time from chest X-ray to diagnosis
-
number of chest X-rays reviewed per reviewer and centre per day
-
of those who had a chest X-ray, the number and proportion of people referred to have a chest CT scan
-
grade of NHS staff reviewing and reporting chest X-ray
-
agreement between AI-derived software and clinician review for normal and abnormal interpretation of chest X-ray
-
number and proportion of chest X-rays defined as abnormal confirmed as abnormal by CT
-
number of cancers detected
-
stage of cancer at detection
-
number of cancers missed, that is, those initially not picked up as abnormal, later referred to chest CT in the study period, and any subsequent cancer diagnosis
-
technical failure and rejection rates
-
all training and software implementation costs
-
characteristics of patients, including age, sex, weight and height or body mass index (BMI), and comorbidities such as asthma, scoliosis, interstitial lung disease, chronic obstructive pulmonary disease (COPD), family background of lung cancer or young people who do not smoke.
Qualitative
-
perceived accuracy of the technology in identifying abnormalities
-
perceived appropriateness of image triage
-
perceived impact on speed of review and reporting
-
perceived software's performance for people with underlying conditions and high-risk groups
-
clinician perspective on the use of AI-derived software.
Other information
The company should describe the process for monitoring the performance of the technologies while they are used in clinical practice. See NICE's evidence standards framework for digital health technologies for guidance on post-deployment reporting of changes in performance. This should include:
-
future plans for updating the technology, including how regularly the algorithms are expected to retrain, re-version or change functionality
-
the sources of retraining data, and how the quality of this data will be assessed
-
processes in place for measuring performance over time, to detect any impacts of planned changes or environmental factors that may impact performance
-
processes in place to detect decreasing performance in certain groups of people overtime
-
whether there is an independent overview process for reviewing changes in performance
-
an agreement on how and when changes in performance should be reported and to whom (evaluators, patients, carers and healthcare professionals).
The company should describe any actions taken in the design of the technology to mitigate against algorithmic bias that could lead to unequal impacts between different groups of people.