Clinical and technical evidence

A literature search was carried out for this briefing in accordance with the interim process and methods statement. This briefing includes the most relevant or best available published evidence relating to the clinical effectiveness of the technology. Further information about how the evidence for this briefing was selected is available on request by contacting mibs@nice.org.uk.

Published evidence

This report summarises 7 observational studies; 3 were prospective and 4 retrospective. Four were in the UK, 1 in the US, 1 in Portugal and 1 in Italy. The studies included retinal images from 199,135 people with diabetes.

The clinical evidence and its strengths and limitations are summarised in the overall assessment of the evidence.

Overall assessment of the evidence

The quality of the evidence supporting the technologies varies. None of the studies was a randomised controlled trial. Most of the evidence (5 of the studies) was on EyeArt. It had more evidence from different countries than the others and has the study with the largest sample size (n=101,710). There was UK evidence for all 3 technologies.

Five out of 7 studies were independent studies not involving the companies or their employees. Three of these were from the UK and received funding from the National Institute for Health Research (NIHR). One NHS-based study funded by the NIHR compared 3 diabetic retinopathy AI technologies: EyeArt, iGradingM and Retmarker. The second and third studies assessed EyeArt against human grading using images from the English NHS diabetic eye screening programme (NDESP).

Because the Ribeiro et al. (2015) study assessed Retmarker in a screening programme in Portugal, the study has a large sample size using the same constantly updating dataset.

Tufail et al. (2016)

Interventions and comparator

EyeArt, iGradingM and Retmarker compared with manual image grading.

Key outcomes

Sensitivity for EyeArt was 94.7% (95% confidence interval [CI] 94.2 to 95.2) for any retinopathy, 93.8% (95% CI 92.9 to 94.6) for referable retinopathy and 99.6% (95% CI 97.0 to 99.9) for proliferative retinopathy.

Sensitivity for Retmarker was 73.0% (95% CI 72.0 to 74.0) for any retinopathy, 85.0% (95% CI 83.6 to 86.2) for referable retinopathy and 97.9% (95% CI 94.9 to 99.1) for proliferative retinopathy.

iGradingM classified all images as either 'disease' or 'ungradable', limiting the possible analysis.

The sensitivity and false positive rates for EyeArt were not affected by ethnicity, sex or camera type, but sensitivity declined marginally with increasing patient age.

Cost analysis estimated that the technologies became more expensive than human grading at a cost of £3.82 per patient for Retmarker and £2.71 per patient for EyeArt.

Strengths and limitations

Strengths: the study had a large sample size, and all 3 technologies evaluated the same images. This independent study was done in the UK in the NDESP. The cost calculations are relevant to the NHS.

Limitations: the cost analysis was limited because the study was not randomised.

Heydon et al. (2021)

Intervention and comparator

EyeArt v2.1.0 compared with human grading.

Key outcomes

Sensitivity of EyeArt was 95.7% (95% CI 94.8 to 96.5) for referable retinopathy (human graded as ungradable, referable maculopathy, moderate to severe non-proliferative or proliferative). This comprises sensitivities of 98.3% (95% CI 97.3 to 98.9) for mild to moderate non-proliferative retinopathy with referable maculopathy, 100% (95% CI 98.7 to 100) for moderate to severe non‑proliferative retinopathy and 100% (95% CI 97.9 to 100) for proliferative disease. EyeArt agreed with the human grade of no retinopathy (specificity) in 68% (95% CI 67 to 69), and 54.0% (95% CI 53.4 to 54.5) when combined with non-referable retinopathy.

Strengths and limitations

Strengths: this was a prospective independent study with NIHR funding and a large sample size from 3 real‑world screening programmes. The authors have reported detailed methods of the processes and protocols. No limitations identified.

Olvera-Barrios et al. (2021)

Intervention and comparator

EyeArt V.2.1.0 using true-colour, wide-field confocal scanning images and standard fundus images in the English NDESP. Imaging with mydriasis (two-field protocol) used the EIDON platform (CenterVue, Padua, Italy) and standard NDESP cameras compared with a human grade of standard NDESP images.

Key outcomes

Sensitivity estimates for retinopathy grades were: (EIDON images) 92.27% (95% CI 88.43 to 94.69) for any retinopathy, 99% (95% CI 95.35 to 100) for vision-threatening retinopathy and 100% (95% CI 61 to 100) for proliferative retinopathy; (NDESP images) 92.26% (95% CI 88.37 to 94.69) for any retinopathy, 100% (95% CI 99.53 to 100) for vision-threatening retinopathy and 100% (95% CI 61 to 100) for proliferative retinopathy.

One case of vision-threatening retinopathy (R1M1) was missed by the EyeArt when analysing the EIDON images but identified by the human graders. The EyeArt identified all cases of vision-threatening retinopathy in the NDESP images.

Strengths and limitations

Strengths: this was an independent study with no support from or affiliation to the company. The quality of reporting was high, and the context relevant, being from a screening programme in the UK. The risk of bias was low because it included a representative sample, published a prospective protocol, used masked graders, and sent anonymised data to EyeArt. The vendor was not allowed access to the software or the dataset during the study period.

Limitations: the analysis was based on retrospectively collected data. In a service evaluation study of the EIDON confocal scanner with human grading, it was evidenced that the EIDON images were able to visualise high-risk retinopathy features missed by the NDESP images. Because of this, the selection of the reference standard can be debatable.

Demography, duration of diabetes, ethnicity, time taken for imaging with each imaging platform and pupillary diameter for this dataset were not analysed. There might be a 'black-box' issue (meaning there is a lack of transparency about how the output was calculated) with the EyeArt and the processing of EIDON images because the reference parameters or data points used by the software might not be the same as the ones used in standard 45-degree colour fundus images, so there could be differences in grading. Further work is needed to define if the wide-field true-colour images provide advantages in terms of diagnostic accuracy with the EyeArt software.

Ribeiro et al. (2015)

Intervention and comparator

RetmarkerSR first, and if disease detected, then human grading; no comparator.

Key outcomes

The screening programme images were analysed in a central reading centre using first an automated disease or no disease analysis and then human grading of the disease cases. Results were: 71.5% no retinopathy, 22.7% non‑proliferative retinopathy, 2.2% maculopathy, 0.1% proliferative retinopathy and 3.5% not classifiable. The authors concluded that using an automated system could reduce the need for human grading by 48.42%.

Some eyes could not be classified (3,132 [3.5%]) because of poor image quality as a result of cataract, myosis, or because the patient was unable to collaborate. The number was similar to the previous screening programme without implementing AI.

The grader identified only 11 cases out of the 3,287 (0.3% of quality control cases, 0.02% of total patients) as having referable retinopathy pathology (false negatives). None of these cases was proliferative retinopathy.

The intra-grader analysis showed an agreement of 98.92% and a sensitivity and specificity of 100% and 99.51%, respectively. The inter-grader analysis showed an overall agreement of 96.65%, with a sensitivity and specificity of 97.52% and 98.55%, respectively.

Strengths and limitations

Strengths: the study reports on a large dataset from a regional screening programme that shows the prevalence of diabetic retinopathy grades. The methods and the results were well described. To check the safety of non-disease cases, a random sample was sent to a masked human grader for double-checking. For quality control, a random sample of all images was graded by a second masked human grader.

Limitations: there was no control group. The annual follow-up re-screening results for non-disease cases were not reported to assess the safety and risk of false negative cases. Two of the authors are affiliated with the company.

Sarao et al. (2020)

Intervention and comparator

Conventional flash fundus camera image for EyeArt v2.1 compared with LED confocal scanner image for EyeArt v2.1.

Key outcomes

Sensitivity, specificity, and area under the curve (AUC) for flash fundus camera were 90.8% (95% CI 85.0 to 94.9), 75.3% (95% CI 68.0 to 81.7) and 0.830 (95% CI 0.78 to 0.87) respectively; and for LED confocal scanner were 94.1% (95% CI 89.1 to 97.3), 86.8% (95% CI 80.7 to 91.6), and 0.905 (95% CI 0.87 to 0.93) respectively. The difference between AUCs was 0.0737 (95% CI 0.0263 to 0.121; p=0.0023).

The receiver operating characteristic curves for referable diabetic retinopathy show that the AUC was higher with the confocal scanner (z statistic 3.047, p=0.002), implying a better grading accuracy.

Images from 8 eyes (2.4%) were classified as ungradable by the human graders so were not considered for comparison with the automated assessment. The reasons were mainly around inadequate field capture (5 eyes) and the presence of significant media opacities (3 eyes).

Strengths and limitations

Strengths: the study reported the imaging protocol and a detailed and high-quality methods and limitations section to avoid bias. After masking the identity of the patient, retinal images were submitted to the Eyenuk cloud. The authors were not affiliated with the company, so this is an independent study.

Limitations: the sample size is relatively small, and the proportion of patients with pre‑proliferative retinopathy is higher than what is typically reported in a screening programme. The EyeArt software was largely trained with conventional flash fundus cameras, which might influence the results obtained with the scanner. Most evaluations of automatic diagnosis of eye diseases focus on a binary classification. In contrast, in a clinical setting, patients typically suffer from several different retinal conditions that reduce the algorithm's accuracy as the number of retinal diseases increases.

Bhaskaranand et al. (2019)

Intervention and comparator

EyeArt; no comparator.

Key outcomes

Automated analysis of the entire dataset (850,908 images) was completed in less than 48 hours; 4.9% of visits were excluded from the analysis because of insufficient information. Of those included, 0.9% were flagged as being unscreenable by the technology and were referred to a specialist. The screening sensitivity of the technology was 91.3% (95% CI 90.9 to 91.7) and specificity was 91.1% (95% CI 90.9 to 91.3). The positive predictive value was 72.5% (95% CI 71.9 to 73.0), and the negative predictive value was 97.6% (95% CI 97.5 to 97.7). The AUC was 0.965 (95% CI 0.963 to 0.966).

Strengths and limitations

Strengths: the study included a large consecutive patient population and the technology assessed retinopathy in 99% of patient visits. Although the study did not include a comparator, specialist assessment of fundus images was used as a standard. The study recruited consecutive patients from over 400 primary care centres to reduce selection bias.

Limitations: the lead author is an employee of Eyenuk. Patient demographic data, such as age and time since diabetes diagnosis, were not collected.

Bouhaimed et al. (2008)

Intervention and comparator

RetinaLyze, expert grading of images; no comparator.

Key outcomes

The technology detected red lesions with a sensitivity of 82%, a specificity of 75%, a positive predictive value of 41%, and a negative predictive value of 95%. The technology detected red and bright lesions with a sensitivity of 88%, a specificity of 52%, a positive predictive value of 28%, and a negative predictive value of 95%.

Strengths and limitations

Strengths: this is an independent study and was done in the UK.

Limitations: study data were collected from 2002 to 2004 and clinical practice is likely to have changed.

Sustainability

The companies claim using AI to screen digital fundus images could reduce environmental impact because less time and fewer staff are needed to process the images. There is no published evidence to support this.

Recent and ongoing studies

No ongoing or in-development studies were identified for the other technologies.