Clinical and technical evidence

This briefing includes the most relevant or best available published evidence relating to the diagnostic accuracy of the technology. Further information about how the evidence for this briefing was selected is available on request by contacting mibs@nice.org.uk.

Published evidence

This briefing summarises 1 poster abstract for nomela, 2 published full-text studies on DERM, 4 published full-text studies on SkinVision and 7 published full-text studies on Moleanalyzer pro. There are further studies published for DERM and Moleanalyzer pro but only the most relevant studies are reported here.

In the 14 studies summarised in this briefing, over 2,598 people were included. Thirteen of the 14 studies included in this briefing assessed diagnostic test accuracy of the different technologies using sensitivity, specificity and receiver operating characteristic (ROC) area under the curve (AUC) compared with a dermatologist's diagnostic accuracy.

The clinical evidence and its strengths and limitations are summarised in the overall assessment of the evidence.

Overall assessment of the evidence

The 1 study for nomela is a prospective study with a large sample size of 1,200 people done in a secondary care setting in 2 phases. The 2 studies for DERM include an algorithm development study compared with current diagnostic practices based on a meta-analysis of 82 published studies and a prospective, multicentre, single-arm clinical validation study of 514 people. The studies for SkinVision are a mix of prospective and retrospective single and multicentre accuracy studies with a total of 628 people included. Five of the 7 studies for Moleanalyzer pro are cross-sectional diagnostic test accuracy studies with a total of 570 images analysed, 1 is a non-randomised comparative study with 72 people, and 1 is a prospective diagnostic accuracy study with 184 people.

The primary outcome in most of the studies was diagnostic accuracy of the technologies against dermatologist clinical examination as a reference standard (measured using ROC AUC, sensitivity and specificity). Thirteen of the 14 studies included in this briefing are full-text papers with different population sizes, of which the smallest is 72 people and the largest is 514 people. The poster abstract for nomela referred to 1,200 people.

All people included in these studies were aged 18 and above. Only 2 studies were done in the UK (McKenna et al. 2020; Phillip et al. 2019), 7 studies were in Germany (Fink et al. 2019; Haenssle et al. 2020; Maier et al. 2015, Winkler et al. 2019, 2020, 2021a, 2021b), 1 study was in Canada (McLellan et al. 2021), 1 was international (Phillips et al. 2020) and 3 studies were in the Netherlands (Sangers et al. 2022; Thissen et al. 2017; Udrea et al. 2020).

Studies done outside the UK may have limited generalisability to the NHS. None of the included studies have any patient-relevant outcomes reported and are not done in all the intended clinical settings (pre-primary care, primary care and secondary care) or with the population of interest. However, Skin Analytics states that there has been a recent study (DERM Health Economics Study 2020/21) on lesions referred from primary care, which included a patient feedback survey and is expected to be published soon. Most of the studies were done in secondary care settings, therefore limiting their generalisability to primary care settings. Further population-based prospective studies in the NHS done in the intended clinical setting, focusing on the impact of accuracy, clinical management and outcomes (both patient and system) would be beneficial in providing further evidence to support clinical adoption and use of these technologies in the NHS.

McKenna et al. (2020)

Study size, design and location

A phase 1 and 2 performance evaluation study of 1,200 adults from a secondary care UK setting. Phase 1: an open prospective study of non‑randomised evaluation from primary care referrals (1,200 people). Phase 2: a retrospective evaluation of historical images using nomela of malignant melanomas, as diagnosed by a pathology department.

Intervention and comparator(s)

nomela software for the detection of melanoma compared with the clinical (and, when possible, histological) diagnosis at primary care.

Key outcomes

The sensitivity of nomela is set at 100%, and its specificity is 53%.

Strengths and limitations

Strengths: the study used a large sample size, which increases its power and reduces bias. The study was done in a UK setting.

Limitations: the secondary care single-centre study design with a wide range of exclusion criteria, although in the UK, may make these results less generalisable to other locations. The poster abstract provides only minimal information to critique but does state that the study design had prospective and retrospective components, which could have led to bias. The exclusion criteria for this study were extensive, which limits its utility as a referral tool in a primary care setting. No people from African or African Caribbean skin type were included in the study but it does state that the mix of Fitzpatrick skin types included type 4 (35.1%), type 5 (8.9%) and type 6 (0.1%).

Phillips et al. (2019)

Interventions and comparator

DERM AI technology compared with dermatologist clinical assessment of likelihood of melanoma with a dermatoscope.

Key outcomes

The algorithm achieved a ROC AUC of 90.1% (95% confidence interval [CI] 86.3% to 94.0%) for biopsied lesions and 95.8% (95% CI 94.1% to 97.6%) for all lesions. When set at 100% sensitivity, the algorithm achieved a specificity of 64.8% with iPhone 6s, while dermatologists achieved ROC AUC of 77.8% (95% CI 72.5% to 81.9%) and a specificity of 69.9%. However, at 95% sensitivity, the specificity of biopsied and all lesions for each camera were as follows: iPhone 6s, 50.6% and 78.1% respectively; Galaxy S6, 46.9% and 75.6% respectively; and for digital single-lens reflex (DSLR) cameras, 27.6% and 45.5% respectively. This compares with a specificity of 69.9% on all lesions by clinicians.

The number of biopsies needed to identify 1 case of melanoma at 95% sensitivity was 3.04 for biopsied lesions and 4.00 for images taken with iPhone 6s; 3.22 and 4.39, respectively, for images taken with Galaxy S6; and 4.32 and 9.02, respectively, for images taken with DSLR. This compares with the number needed by clinicians of 4.92. At 100% negative predictive value, the positive predictive value was 20.3% for clinicians' assessment, 17.9% for images taken with iPhone 6s, 13.4% for Galaxy S6 and 9.5% for DSLR.

Strengths and limitations

Strengths: the study included a large sample size of 514 people and was done in 7 UK hospitals. Gender distribution is almost balanced, and the mean age allows generalisability of the study. The study used 3 different cameras and included both biopsied lesions and lesions that were clearly benign.

Limitations: 96.8% of the people in the study were white, which may introduce bias in the accuracy of detecting melanoma in white, black and brown skin. A total of 849 images were excluded because of poor quality. The secondary care settings limit the generalisability of these results to a primary care setting.

Phillips et al. (2020)

Intervention and comparators

DERM technology accuracy compared with clinical assessment performance, assessed by meta-analysis of studies examining the accuracy of naked-eye examination, with or without dermoscopy, by specialist and general physicians whose clinical diagnosis was compared with histopathology.

Key outcomes

DERM achieved a ROC AUC of 93% (95% CI 92% to 94%). The statistically determined optimum sensitivity and specificity were 85.0% and 85.3%, respectively, though these are not DERM settings proposed for use in clinical practice. When DERM was set to achieve 95% sensitivity, it achieved a specificity of 64.1%. When DERM was set to achieve 95% specificity, it achieved a sensitivity of 66.9%. In comparison, a meta-analysis of more than 10 studies showed that primary care physicians achieve an AUC of 83% (95% CI 79% to 86%), with sensitivity and specificity of 79.9% and 70.9%. Dermatologists (92 studies) achieved an AUC of 91% (95% CI 88% to 93%) and sensitivity and specificity of 87.5% and 81.4%, respectively.

Strengths and limitations

Strengths: the algorithm wasn't developed using a single dataset of images. The study reports a range of data from the development of an early version of DERM, providing transparency to the underlying drivers of the performance of the algorithm. The use of a meta-analysis to derive performance of standard care was novel at the time.

Limitations: the study does not define the ethnic diversity included and there is limited information about the demographic data of images used. Since this was not a systematic literature review, there may be some bias about determining the included papers for the comparison. In addition, and linked to this, the number of studies pooled in each group for the comparison of primary and secondary care is not equivalent to give a comparable effect size.

Udrea et al. (2020)

Intervention and comparator

Smartphone application machine learning algorithms.

Key outcomes

Overall, the algorithm has a sensitivity of 95.1% (95% CI 91.9% to 97.3%) to detect skin cancer. In particular, the sensitivity to detect melanoma is 92.8% (95% CI 87.8% to 96.5%) and the sensitivity in detecting keratinocyte carcinoma and their precursors is 97.3% (95% CI 93.2% to 99.3%). The specificity of the algorithm is 78.3% (95% CI 77.24% to 79.34%).

Strengths and limitations

Strengths: the study used data from different sources, including 2 previous studies and a smartphone application user database. Combining data from different sources provides a larger dataset.

Limitations: the study is retrospective, drawing on data from 2 previously published studies and a smartphone application user database. Because of the inclusion of mainly high-risk lesions in the clinical studies, it may mean the dataset is inadequate in the evaluation of specificity. It was not possible to have complete follow up for all smartphone application users, so it was not possible to calculate positive and negative predictive values. Dermatologist assessment for risk rating was based on images taken by users without further investigation, and the cases were clinically validated as benign without histopathological report. The main risk associated with the use of smartphone applications by lay users is that malignant melanoma and keratinocyte carcinoma are incorrectly classified as low risk (that is, a false negative) and their diagnosis and treatment is delayed.

Fink et al. (2019)

Intervention and comparator

Moleanalyzer pro compared with 11 dermatologists who were presented with dermoscopic images on screen.

Key outcomes

For the classification of the 72 dermoscopic images, dermatologists showed a sensitivity, specificity and diagnostic odds ratio (DOR) of 90.6%, 71.0% and 24, respectively. With the same set of images, the Moleanalyzer pro revealed a diagnostic performance of sensitivity of 97.1%, specificity of 78.8% and DOR of 34. However, there was no statistically significant meaningful difference between dermatologists and a convolutional neural network.

Strengths and limitations

Strengths: the study comparator included 11 dermatologists ranging in experience and practice.

Limitations: non-randomisation of the study might introduce bias. There is no reference to a sample size calculation. The study included only 2 types of lesions, which is not representative of clinical practice. Also, dermatologists were not given additional information for decision making that would be present in a real-life clinical setting and therefore may underestimate the diagnostic performance of dermatologists in real-life practice. The study does not define the ethnic diversity included or provide data on skin types.

Haenssle et al. (2020)

Intervention and comparator

Moleanalyzer pro analysing dermoscopic images was compared with dermatologist image assessment using a web-based platform at 2 levels; at level 1, dermoscopic images were assessed alone, and at level 2, images included clinical close-ups, dermoscopy and textual information.

Key outcomes

Moleanalyzer pro achieved a sensitivity, specificity and ROC AUC of 95.0% (95% CI 83.5% to 98.6%), 76.7% (95% CI 64.6% to 85.6%) and 91.8% (95% CI 86.6% to 97.0%), respectively. At level 1, for management decisions based on 1 dermoscopic image per case, the dermatologists' sensitivity and specificity was significantly lower than Moleanalyzer pro at 89.0% (95% CI 87.4% to 90.6%) compared with 95.0% (95% CI 83.5% to 98.6%), p<0.001.

With level 2 information, the sensitivity for dermatologists significantly improved to 94.1% (95% CI 93.1% to 95.1%; p<0.001), while the specificity remained unchanged at 80.4% (95% CI 78.4% to 82.4%; p=0.97). When fixing the Moleanalyzer pro's specificity at 80.4% (the mean specificity of the dermatologists' management decision in level 2), the sensitivity was 95.0% (95% CI 83.5% to 98.6%), almost equivalent to the sensitivity of the dermatologists, which was 94.1% (95% CI 93.1% to 95.1%), p=0.1. The Moleanalyzer pro demonstrated accuracy of 84.0% (95% CI 75.6% to 89.9%) compared with mean dermatologists' accuracy of 85.9% (95% CI 84.7% to 87.1%), p=0.003.

Strengths and limitations

Strengths: the study included a comparison at level 2, where dermatologists have additional information to base their decision and is more reflective of clinical practice.

Limitations: the study was done in a single centre and is based on selected cases without clear reporting of inclusion and exclusion criteria. The dataset had a small proportion of melanoma cases per test set. This limited number of melanoma cases means it cannot be generalised to a primary care setting. The study does not report the sample's ethnic diversity or demographic data, including Fitzpatrick skin type.

MacLellan et al. (2021)

Intervention and comparator

Non-invasive imaging techniques (Moleanalyzer pro compared with MelaFind and Verisante Aura) and teledermatology (teledermoscopist) compared with dermatologist naked-eye examination (clinical examination) with a dermatoscope.

Key outcomes

Sensitivity and specificity were 88.1% (95% CI 79.4% to 96.9%) and 78.8% (95% CI 71.5% to 86.2%), respectively, for Moleanalyzer pro and 96.6% (95% CI 91.91% to 101.31%) and 32.2% (95% CI 18.4% to 46.0%), respectively, for dermatologist examination.

Moleanalyzer pro achieved the highest results in specificity. Sensitivity and specificity for other comparators were reported and can be found in the study results.

Strengths and limitations

Strengths: different non-invasive imaging techniques were compared, ensuring accurate comparability of the result. All lesions were excised regardless of the clinical diagnosis, which enabled gold standard review by 2 dermatopathologists. The study uses pathology as the comparison of sensitivity and specificity.

Limitations: the study is a non-UK study, which may limit its generalisability to the NHS setting. The study does not define the ethnic diversity included and excluded Fitzpatrick skin type 2 and above. The sample size of melanoma is small (only 32), and 2 out of 3 individuals with melanoma were males. The study lacked statistical comparison between the intervention and comparator.

Winkler et al. (2019)

Intervention and comparator

Moleanalyzer pro accuracy in detecting melanoma in skin, both marked and not marked with gentian violet surgical marker.

Key outcomes

In unmarked skin lesions, Moleanalyzer pro achieved a sensitivity of 95.7% (95% CI 79% to 99.2%), specificity of 84.1% (95% CI 76.0% to 89.8%) and ROC AUC of 96.9% (95% CI 93.5% to 100%). In marked skin lesions, Moleanalyzer pro achieved a sensitivity of 100% (95% CI 85.7% to 100%), specificity of 45.8% (95% CI 36.7% to 55.2%) and ROC AUC of 92.2% (95% CI 87.1% to 100%), p<0.001, demonstrating that skin markings increased the false-positive rate of Moleanalyzer pro.

Strengths and limitations

Strengths: the study used a large sample size of 130 melanocytic lesions, which gives good power and reduces bias. The study also considered the practical impact of artefacts, in this case a purple marker used when lesions were listed for excision or reviewed by a clinician and image cropping, on the accuracy of Moleanalyzer pro.

Limitations: the study included a highly selected population, as only benign naevi or melanoma were included in this cohort, and as such is not applicable to current clinical practice. As skin markings were electronically duplicated from digital images and superimposed on a melanoma background, this may introduce bias in results. Fitzpatrick skin type and ethnicity, age and gender of included cases were not reported.

Winkler et al. (2020)

Intervention and comparator

Moleanalyzer pro accuracy in different subtypes of melanoma (for example, superficial spreading melanoma [SSM], lentigo maligna melanoma [LMM], nodular melanoma [NM], mucosal melanoma [MM], acrolentiginous melanoma [AMskin] and acral melanoma [AMnail]) compared with recorded ground truth of all melanoma cases (n=180) and benign lesions (n=600) based on histopathological diagnosis.

Key outcomes

Moleanalyzer pro achieved a high-level performance in set SSM, NM and LMM with sensitivity of more than 93.3%, specificity of more than 65% and ROC AUC of more than 92.6%. In set AMskin, sensitivity was lower at 83.0%, with a specificity of 91.0% and ROC AUC of 92.8%, while set-AMnail sensitivity was 53.3%, with a specificity of 68.0% and ROC AUC of 62.1%.

Strengths and limitations

Strengths: the study addresses that there are different subtypes of melanoma, which have different clinical presentations and differing outcomes, demonstrating that accuracy is affected by subtype. Ground truth for melanoma was histological diagnosis. The study used a large sample of images, which reduces bias.

Limitations: selecting dermoscopic images from local libraries of different institutions (Lyon and Munich) does not guarantee a representative sample of the population and further limits its generalisability to the NHS. A small cohort of melanoma was included in each dataset (n=25) and there was variable gender distribution. The study does not define the ethnic diversity included.

Winkler et al. (2021a)

Intervention and comparator

In this diagnostic test accuracy study, Moleanalyzer pro performance was compared on images with and without a digitally superimposed scale bar.

Key outcomes

In images without a scale bar, Moleanalyzer pro achieved a sensitivity of 87% (95% CI 67.9% to 95.5%), specificity of 87.9% (95% CI 80.3% to 92.8%) and ROC AUC of 95.3% (95% CI 91.4% to 99.2%). In images with a scale bar, no significant change was seen in the sensitivity range (87% to 95%, all p=1.0). However, specificity was reduced by 4 scale bars (range 0% to 43.9%, all p<0.001). ROC AUC was also reduced by 2 scale bars (range 52.0% to 84.8%, both p≤0.042).

Strengths and limitations

Strengths: the study demonstrates that clinically relevant artefacts impact on accuracy, in this case a digitally superimposed scale bar (scale bars are used when imaging lesions are subsequently listed for excision or reviewed by a clinician). The study used a large sample size, which reduces bias.

Limitations: the image sets used were limited in diagnosis (melanomas and naevi) and numbers (130 images per set). This may impact its overall generalisability. The image library comes from a single centre and is therefore less applicable to the UK. Fitzpatrick skin type and skin colour are not reported. Benign naevi did not have histological ground truth.

Winkler et al. (2021b)

Intervention and comparator

Moleanalyzer pro compared with images of lesions presented for review to a collective human intelligence (CHI) of 120 dermatologists who were offered a choice of 6 different diagnoses.

Key outcomes

CHI achieved a significantly higher accuracy of 80% (95% CI 62.1% to 90.5%) compared with Moleanalyzer pro at 70% (95% CI 52.1% to 83.3%), p<0.001.

CHI achieved a higher sensitivity of 82.4% (95% CI 59.0% to 93.8%) and specificity of 76.9% (95% CI 49.7% to 91.8%) than Moleanalyzer pro's sensitivity of 70.6% (95% CI 46.9% to 86.7%) and specificity of 69.2% (95% CI 42.4% to 87.3%).

The diagnostic accuracy of CHI was superior to that of individual dermatologists (p<0.001) in multiclass evaluation, while the accuracy of the latter was comparable to multiclass Moleanalyzer pro.

Strengths and limitations

Strengths: the technology is compared with individual or collective dermatologists, which gives a good indication of its accuracy.

Limitations: the study used a relatively small sample size of 30 'difficult to diagnose' lesions, which may reduce its power and introduce bias. The results are applicable to a secondary care population only. Because of the limited test cases, a statistically significant difference could not be found for a number of assessments. The study does not define the sample's ethnic diversity.

Maier et al. (2015)

Intervention and comparator

SkinVision application's accuracy in detecting melanoma compared with clinical diagnosis and histological results as gold standard.

Key outcomes

Compared with histological results, the sensitivity of SkinVision was 73% (95% CI 52% to 88%), the specificity was 83% (95% CI 75% to 89%) and the accuracy was 81% (95% CI 74% to 87%). The positive predictive value for recognising melanoma was 49% (95% CI 32% to 65%) and the negative predictive value was 93% (95% CI 87% to 97%).

Strengths and limitations

Strengths: the study used a large sample size of images, which reduces bias.

Limitations: this is a single-centre study based outside of the UK, which could reduce its generalisability to the NHS. The study was limited to pigmented lesions only; this is not applicable to NHS clinical practice as standard practice does not have isolated clinics for pigmented lesions only. The study includes only 3 lesion types (there are actually more than 2,000 skin conditions) and although prospective, this is a selected population and therefore not generalisable to primary care. Also, 26% of lesions were excluded from analysis because of poor images.

As per the company's feedback, the study was done on an older version of the SkinVision service and may not be representative of its current performance.

Thissen et al. (2017)

Study size, design and location

A prospective study of 256 adults (with 341 lesions) in a secondary care setting in the Netherlands. The study was done in a secondary care setting with 1 dermatologist and 1 trainee dermatologist. Images of patients were acquired by the dermatologist.

Intervention and comparator

SkinVision application's sensitivity and specificity in the diagnosis of melanoma and non-melanoma skin cancer along with actinic keratosis and Bowen's disease compared with the histopathology and clinical diagnosis of clearly benign lesions.

Key outcomes

Images of 233 of the 341 lesions were used to train the algorithms, and the remaining 108 lesions were used as test data.

High-risk lesions (n=44): sensitivity was 80% (95% CI 62% to 90%) and positive predictive value was 63% (95% CI 47% to 77%).

Low- to medium-risk lesions (n=64): specificity was 78% (95% CI 66% to 86%) and the negative predictive value was 89% (95% CI 78% to 95%).

Strengths and limitations

Strengths: the study used a large sample size, which reduces bias.

Limitations: the study is not based in the UK and includes premalignant skin lesions, which is not applicable to NHS practice. SkinVision technology is designed as a patient-facing technology, although it can also be used in secondary care. This study does not provide information about its intended use. Although a large sample of 341 images was used, most of these were for training the algorithm and only 108 were used for testing, which might reduce the study power. There was no reference to a power calculation. Additionally, only 4 melanomas were used to train or calibrate the algorithm and only 2 melanomas were used for the test set. This study, therefore, cannot be used to infer any data about diagnostic accuracy for melanoma. Furthermore, the high-risk category included premalignant skin lesions, which are not considered high risk in the UK and would not be referred on a 2‑week wait pathway to secondary care. This further limits generalisability to a UK NHS population. The study does not define skin type or include an ethnically diverse population. The images were acquired by the doctor but the company propose this technology as a patient-facing app, so it is not demonstrating its proposed use in practice. In routine clinical practice in the NHS, GPs and patients frequently take images which are out of focus and therefore would impact sensitivity.

The company reports that the study was done on an older version of the SkinVision service algorithm and may not be representative of its current performance.

Sangers et al. (2022)

Intervention and comparator

SkinVision application's accuracy in detecting premalignancy and malignancy in skin lesions compared with histopathology and clinical diagnosis of clearly benign lesions.

Key outcomes

Overall sensitivity and specificity for the app were 86.9% (95% CI 82.3% to 90.7%) and 70.4% (95% CI 66.2% to 74.3%), respectively. The subgroup analysis sensitivity was significantly higher for iOS-operated devices compared with Android-operated devices (91% compared with 83%; p<0.001). The specificity calculated on benign control lesions was significantly higher than suspicious skin lesions (80.1% compared with 45.5%; p<0.001).

Strengths and limitations

Strengths: this was a prospective multicentre study using a large sample size, which reduces bias.

Limitations: its cross-sectional design and the fact it was done outside of the UK might limit its generalisability to NHS settings. There was a small number of melanoma cases in the dataset. The study does not define the sample's ethnic diversity. Only 4 people with Fitzpatrick skin type 4 were included and there were no people included with types 5 or 6. There were only 6 malignant melanomas and 6 in situ melanomas in this dataset, rendering this data insufficient to reliably report melanoma detection accuracy.

Sustainability

These digital technologies have the potential to reduce carbon emissions from travelling to and from appointments because of the projected number of reduced onward referrals to secondary care. However, none of the 4 companies have provided any information about sustainability and there is no published evidence to support the theory that the technology can reduce footfall or the impact that a reduced footfall would make on the volume of carbon emissions.

Recent and ongoing studies