关键词: artificial intelligence cancer classification crowdsourced crowdsourcing deep learning dermascopic dermatologist dermatology dermatoscopy dermoscopy development diagnosis diagnostic feasibility imaging labeling lesion machine learning medical image melanoma microscopy pigmentation skin

来  源:   DOI:10.2196/38412   PDF(Pubmed)

Abstract:
BACKGROUND: Dermoscopy is commonly used for the evaluation of pigmented lesions, but agreement between experts for identification of dermoscopic structures is known to be relatively poor. Expert labeling of medical data is a bottleneck in the development of machine learning (ML) tools, and crowdsourcing has been demonstrated as a cost- and time-efficient method for the annotation of medical images.
OBJECTIVE: The aim of this study is to demonstrate that crowdsourcing can be used to label basic dermoscopic structures from images of pigmented lesions with similar reliability to a group of experts.
METHODS: First, we obtained labels of 248 images of melanocytic lesions with 31 dermoscopic \"subfeatures\" labeled by 20 dermoscopy experts. These were then collapsed into 6 dermoscopic \"superfeatures\" based on structural similarity, due to low interrater reliability (IRR): dots, globules, lines, network structures, regression structures, and vessels. These images were then used as the gold standard for the crowd study. The commercial platform DiagnosUs was used to obtain annotations from a nonexpert crowd for the presence or absence of the 6 superfeatures in each of the 248 images. We replicated this methodology with a group of 7 dermatologists to allow direct comparison with the nonexpert crowd. The Cohen κ value was used to measure agreement across raters.
RESULTS: In total, we obtained 139,731 ratings of the 6 dermoscopic superfeatures from the crowd. There was relatively lower agreement for the identification of dots and globules (the median κ values were 0.526 and 0.395, respectively), whereas network structures and vessels showed the highest agreement (the median κ values were 0.581 and 0.798, respectively). This pattern was also seen among the expert raters, who had median κ values of 0.483 and 0.517 for dots and globules, respectively, and 0.758 and 0.790 for network structures and vessels. The median κ values between nonexperts and thresholded average-expert readers were 0.709 for dots, 0.719 for globules, 0.714 for lines, 0.838 for network structures, 0.818 for regression structures, and 0.728 for vessels.
CONCLUSIONS: This study confirmed that IRR for different dermoscopic features varied among a group of experts; a similar pattern was observed in a nonexpert crowd. There was good or excellent agreement for each of the 6 superfeatures between the crowd and the experts, highlighting the similar reliability of the crowd for labeling dermoscopic images. This confirms the feasibility and dependability of using crowdsourcing as a scalable solution to annotate large sets of dermoscopic images, with several potential clinical and educational applications, including the development of novel, explainable ML tools.
摘要:
背景:皮肤镜检查通常用于评估色素性病变,但是,众所周知,专家之间在识别皮肤结构方面的协议相对较差。医疗数据的专家标签是机器学习(ML)工具开发的瓶颈,众包已被证明是一种成本和时间高效的医学图像标注方法。
目的:本研究的目的是证明众包可用于标记色素性病变图像中的基本皮肤镜结构,具有与专家组相似的可靠性。
方法:首先,我们获得了248张黑素细胞病变图像的标签,其中31张皮肤镜\“子特征\”由20位皮肤镜专家标记。然后根据结构相似性将这些折叠成6个皮肤透视的“超级特征”,由于评分者间可靠性(IRR)较低:点,小球,线条,网络结构,回归结构,和船只。然后将这些图像用作人群研究的黄金标准。商业平台DiagnosUs用于从非专家人群中获取248张图像中存在或不存在6个超级特征的注释。我们与7名皮肤科医生一起复制了这种方法,以与非专家人群进行直接比较。科恩κ值用于衡量评估者之间的一致性。
结果:总计,我们从人群中获得了139,731个皮肤镜超特征的评分。点和小球的鉴定一致性相对较低(中位数κ值分别为0.526和0.395),而网络结构和血管显示出最高的一致性(中位数κ值分别为0.581和0.798)。在专家评估者中也看到了这种模式,他们的点和小球的中位数κ值为0.483和0.517,分别,网络结构和船只为0.758和0.790。非专家和阈值平均专家读者之间的中位数κ值为0.709点,0.719为小球,线0.714,网络结构为0.838,回归结构为0.818,和0.728的船只。
结论:这项研究证实,一组专家对不同皮肤镜特征的IRR不同;在非专家人群中观察到类似的模式。人群和专家之间的6个超级特征中的每一个都有很好或很好的协议,突出了标签皮肤镜图像的人群的相似可靠性。这证实了使用众包作为可扩展解决方案来注释大型皮肤镜图像的可行性和可靠性,有几个潜在的临床和教育应用,包括小说的发展,可解释的ML工具。
公众号