Clinical and technical evidence

A literature search was carried out for this briefing in accordance with the interim process and methods statement. This briefing includes the most relevant or best available published evidence relating to the clinical effectiveness of the technology. Further information about how the evidence for this briefing was selected is available on request by contacting mibs@nice.org.uk.

Published evidence

Eleven studies are summarised in this briefing.

The briefing includes 7 validation studies, 3 observational studies and a before-and-after historic control study including a total of 90,773 CT brain scans.

The clinical evidence and its strengths and limitations is summarised in the overall assessment of the evidence.

Overall assessment of the evidence

Six of the studies are reported as abstracts and lack methodological detail. The other 5 studies are peer-reviewed publications. The 7 validation studies report outcome measures relevant for establishing the accuracy, sensitivity and specificity of the software. The remaining 4 studies explore the usefulness of the technology in the clinical setting, including outcome measures related to the potential clinical and systematic benefits of the technology. It is not always clear whether the technology described in the studies has been updated since publication and many of the named authors involved in the studies work for the company. This is likely to be related to involvement in the technology development.

The evidence base would benefit from randomised controlled trials assessing the effect of the technology on patient outcomes. This should include a follow-up period to capture any adverse events related to misdiagnosis, time to treatment and time saved per scan reported.

Aidoc: head

Four studies presented in this briefing from 3 abstracts and 1 validation study, including 64,990 non-contrast CT (NCCT) head scans.

Davis et al. (2019)

Study size, design and location

Before-and-after study of 51,793 head scans in the US. Investigating the effect of using Aidoc: head to assist decision making for detecting intracranial haemorrhage (ICH) in emergency department and inpatient head scans on patient length of stay and turnaround time.

Intervention and comparator(s)

Aidoc: head compared with standard reporting.

Key outcomes

Compared with standard reporting, the use of Aidoc: head significantly reduced the turnaround time from 53 minutes to 46 minutes for head CT cases that were positive for ICH (p<0.001). Inpatient length of stay for positive cases decreased from 9,950 minutes to 8,870 minutes, but this was not statistically significant (p>0.05). Emergency department length of stay reduced significantly from 567 minutes to 508 minutes (p<0.001).

Strengths and limitations

This large before-and-after multicentre study reports relevant systematic outcomes. The study is reported as an abstract and has limited methodological information, limiting the value of the findings. The abstract does not report patient demographic data, the selection method for the historic control, or the protocol used for reporting ICH. Results may not be generalisable to the NHS because the study was done outside the UK.

Desbuquoit et al. (2019)

Study size, design and location

Prospective study of 500 NCCT head scans in Belgium. Validating the detection of ICH using Aidoc: head software compared with expert neuroradiologist in a retrospective analysis.

Intervention and comparator(s)

Aidoc: head and expert neuroradiologist review.

Key outcomes

Overall the software had a sensitivity of 95% and specificity of 94.3% for identifying pathological hyperintensities. The false positives were mainly because of hardening artefacts, hyperdense dural sinuses, or falcine or basal ganglia calcifications. False negatives were because of small haemorrhages.

Strengths and limitations

The study is presented as an abstract with limited information. The retrospective nature of the analysis increases the risk of selection bias. Confidence intervals are not reported and there is limited detail about the methodology presented in the abstract.

Ojeda et al. (2019)

Study size, design and location

Retrospective analysis of 7,112 NCCT head scans in the US. Validating the detection of ICH using Aidoc: head software compared with expert radiologist and picture archiving and communication systems (PACS) query.

Intervention and comparator(s)

Aidoc: head and expert neuroradiologist reports and PACS queries.

Key outcomes

Overall accuracy of the software to detect ICH was 98%, sensitivity was 95% and specificity was 98%.

Strengths and limitations

The data used for validating the software were not included in the development of the technology. The research team were blinded to the ground truth labels. The retrospective nature of the analysis increases the risk of selection bias. Confidence intervals are not reported and there is limited detail about the methodology presented in the abstract.

Rao et al. (2019)

Intervention and comparator(s)

Aidoc: head compared with original report.

Key outcomes

Of the 5,585 NCCT head scans reported to be negative for ICH by a radiologist, Aidoc: head identified 28 cases that were positive for ICH. After review by 3 neuroradiologists, 16 of the 28 cases were confirmed to have an ICH that had not been found on the original report.

Strengths and limitations

The large multicentre study addresses a relevant clinical and systematic outcome. The study is reported as an abstract and is limited in methodological detail. The retrospective nature of the study increases the risk of selection bias. The abstract does not state the level of experience of the radiologists responsible for the original reports. Results of statistical analyses were not reported. Results may not be generalisable to the NHS as the study was done outside the UK.

e-CTA

Four studies are presented in this briefing. Two are published and 2 are abstracts, with a total of 2,519 patients. Only the most relevant studies have been presented. The evidence base for e‑ASPECTS consists of a further 10 studies; 4 validation studies (Herweh et al. 2016; Nagel et al. 2017; Goebel et al. 2018; Sundaram et al. 2019), and 5 observational studies investigating the relationship between e-ASPECT score, clinical outcome, imaging measures and clinician decision making (Goberina et al. 2018; Nagel et al. 2019; Pfaff et al. 2017; Olive-Gadea et al. 2018; Grunwald et al. 2016).

Gunda et al. (2019)

Intervention and comparator(s)

e‑CTA and e‑ASPECTS compared with standard care.

Key outcomes

After implementation of e‑CTA and e‑ASPECTS the number of patients having thrombolysis increased from 11.5% to 18.1% and the number of patients referred for thrombectomy increased (11 to 19). Mean time to treatment decreased from 44 minutes to 41 minutes for thrombolysis and from 174 minutes to 145 minutes for mechanical thrombectomy.

Strengths and limitations

The study assesses the impact of the technology on clinically and systematically relevant outcome measures. Using real-world data allows generalisability of findings, but generalisability should be addressed with caution because of differences between healthcare systems. The study data were presented in an abstract with limited information. The protocol used for imaging analysis during the 2017 standard care period is not outlined.

Nagel et al. (2018)

Intervention and comparator(s)

e‑ASPECTS and no comparator.

Key outcomes

Decreasing e‑ASPECTS scores were significantly correlated with baseline NIHSS scores (r=-0.31; p>0.0001). Univariate analysis found lower e‑ASPECT scores (per 1‑point decrease) were significantly associated with worse 90-day clinical outcome; death or disability (modified Rankin score 2 to 6; odds ratio [OR] 0.81; 95% confidence interval [CI] 0.77 to 0.86), death or disability (modified Rankin score 3 to 6; OR 0.89; 95% CI 0.83 to 0.95), and death (OR 0.86; 95% CI 0.79 to 0.95).

Strengths and limitations

The large multicentre study used relevant measures of clinical outcome to show the relevance of the e‑ASPECT score. Selection criteria and methodology were clearly outlined. Appropriate statistics were applied to investigate the relationship between e‑ASPECT scores and clinical measures. Sensitivity analyses were reported to show robustness of findings. The retrospective nature of the study limits its value for interpretation of the real-time use of the technology. The study does not report the systematic benefits of the technology. The first author has received expenses and consultancy fees from the company. Results may not be generalisable to the NHS because the study was done outside the UK.

Grundwald et al. (2019)

Intervention and comparator(s)

e‑CTA (Brainomix) and 3 neuroradiologists.

Key outcomes

Automated e‑CTA score agreed with the consensus score in 90% of cases. The remaining 10% were 1 point off the consensus score (intraclass correlation coefficient 0.93, 0.90 to 0.95). Sensitivity and specificity for identifying favourable collateral flow were reported as 0.99 (0.93 to 1.00) and 0.94 (0.70 to 1.00), respectively. Automated e‑CTA score correlated positively with Alberta Stroke programme early CT score (spearman correlation 0.46, p=0.0001).

Strengths and limitations

The study compared the automated e‑CTA score with the scores of 3 blinded experienced neuroradiologists and with a consensus score from the experienced neuroradiologists after unblinding. The study uses appropriate bootstrapping for statistical analysis of imaging data. The combined scoring of the 3 neuroradiologists does not reflect real-world practice. Patient demographic data and clinical outcomes were not reported. Authors involved in the development of this publication work for the company.

Seker et al. (2019)

Intervention and comparator(s)

e‑CTA compared with 2 blinded expert neuroradiologists and with a non-blinded experienced interventional neuroradiologist with unrestricted clinical and imaging data access.

Key outcomes

Compared with expert radiologist analysis, the accuracy of e‑CTA to detect any occlusion was 0.88 (0.81 to 0.92), with a sensitivity of 0.79 (0.68 to 0.87) and specificity of 0.97 (0.91 to 1.00). Accuracy to detect proximal occlusions was 0.90 (0.84 to 0.94), with a sensitivity of 0.91 (0.79 to 0.98) and specificity of 0.90 (0.83 to 0.95). Scores were similar to the blinded neuroradiologist resident, and the blinded neuroradiologist scores matched the experienced neuroradiologist analysis.

Strengths and limitations

The study is presented as an abstract with limited information. It compared the technology with blinded specialists as well as non-blinded specialists. Accuracy, sensitivity and specificity were reported for both blinded specialist and the technology. No statistical analyses were performed to compare the differences between blinded specialists, the technology and the control. Time taken for algorithm to run and specialists to score were not reported. Authors involved in the development of this publication work for the company.

Icobrain

One study presented in this briefing, including 252 patients.

Jain et al. (2019)

Intervention and comparator(s)

Icobrain compared with expert segmentation.

Key outcomes

Median volume difference between expert assessment and icobrain were 0.07 ml for acute intracranial lesions (n=144) and -0.01 ml for cistern segmentation (n=38). Correlation between expert assessments and icobrain was 0.91 for volume of acute intracranial lesion and 0.94 for volume of cisterns. Median precision and sensitivity of 0.75 and 0.75, respectively, for acute intracranial lesion. Precision and sensitivity were 0.72 and 0.69, respectively, for cistern segmentation. For midline shift, median shift difference was -0.22 mm with a correlation of 0.93 with expert measurement.

Strengths and limitations

The study outlines a detailed methodology describing training and validation. The data are multicentred, and varied protocols are used to address different injuries. The methodology states data for 5,000 patients were available and 252 patients included. The study does not state the selection criteria for the sample used. The training method describes a cascade approach which differs from other artificial intelligence systems but is considered appropriate for segmentation. The lead author works for the company.

Zebra

One study presented in this briefing from an abstract presented at a conference including retrospective analysis of 1,426 CT scans.

Bar et al. (2018)

Intervention and comparator(s)

Zebra triage compared with expert-validated annotation.

Key outcomes

Zebra triage had an area under the curve of 0.9481 in an enriched dataset (64% ICH positive scans) and 0.9487 in a randomly distributed datasets (16% ICH positive scans) in the accurate classification of ICH. Manual review of false positives showed misclassification was most likely in cases of calcification.

Strengths and limitations

Information is limited because the publication is an abstract. The abstract describes the training of the technology and reports the area under the curve for detecting ICH across 2 datasets. It is unclear from the abstract whether the cases used for training were included in the test datasets. The abstract suggests further learning would improve performance; this indicates the technology used in the study may be different from the current version. Authors involved in the development of this publication work for the company.

qER

One study presented in this briefing including retrospective analysis of 21,586 CT scans.

Chilamkurthy et al. (2018)

Intervention and comparator(s)

qER compared with the gold standard from the clinical report and the consensus of 3 independent radiologists.

Key outcomes

The technology was validated against 2 datasets, 1 of 21,095 (Qure25k) CT scans and another of 491 (CQ500) CT scans. At a high sensitivity operating point, sensitivities of the algorithm for ICH, calvarial fracture and midline shift in the Qure25k dataset were 0.90 (95% CI 0.89 to 0.91), 0.90 (95% CI 0.88 to 0.91) and 0.91 (95% CI 0.89 to 0.93), respectively, and specificities were 0.73 (95% CI 0.72 to 0.73), 0.77 (95% CI 0.77 to 0.78) and 0.84 (95% CI 0.83 to 0.84), respectively. For the CQ500 dataset, the sensitivities of the algorithm for ICH, calvarial fracture and midline shift at a high sensitivity operating point were 0.94 (95% CI 0.90 to 0.97), 0.95 (95% CI 0.83 to 0.99) and 0.94 (95% CI 0.85 to 0.98), respectively, and specificities of 0.71 (95% CI 0.65 to 0.76), 0.86 (95% CI 0.82 to 0.89) and 0.89 (95% CI 0.86 to 0.92), respectively.

Strengths and limitations

This is a large and well-designed validation study. The training and methodology are well detailed. Scans used to train the software were not included in the datasets used for validating the software. Scans included in the Qure25k dataset were randomly allocated. The CQ500 dataset was not randomly allocated and could be subject to selection bias. Algorithm run time was not reported. Authors involved in the development of this publication work for the company.

Sustainability

The companies did not make any relevant claims about the sustainability aspects of these technologies.

Recent and ongoing studies