关键词: B-lines POCUS artificial intelligence classification crowdsource crowdsourced crowdsourcing data science diagnose diagnosis diagnostic gamification gamified gamify imaging label labeling labels lung lung ultrasound machine learning medical image point-of-care ultrasound pulmonary respiratory ultrasound

Mesh : Crowdsourcing / methods Humans Ultrasonography / methods standards Lung / diagnostic imaging Prospective Studies Female Male Machine Learning Adult Middle Aged Retrospective Studies

来  源:   DOI:10.2196/51397

Abstract:
BACKGROUND: Machine learning (ML) models can yield faster and more accurate medical diagnoses; however, developing ML models is limited by a lack of high-quality labeled training data. Crowdsourced labeling is a potential solution but can be constrained by concerns about label quality.
OBJECTIVE: This study aims to examine whether a gamified crowdsourcing platform with continuous performance assessment, user feedback, and performance-based incentives could produce expert-quality labels on medical imaging data.
METHODS: In this diagnostic comparison study, 2384 lung ultrasound clips were retrospectively collected from 203 emergency department patients. A total of 6 lung ultrasound experts classified 393 of these clips as having no B-lines, one or more discrete B-lines, or confluent B-lines to create 2 sets of reference standard data sets (195 training clips and 198 test clips). Sets were respectively used to (1) train users on a gamified crowdsourcing platform and (2) compare the concordance of the resulting crowd labels to the concordance of individual experts to reference standards. Crowd opinions were sourced from DiagnosUs (Centaur Labs) iOS app users over 8 days, filtered based on past performance, aggregated using majority rule, and analyzed for label concordance compared with a hold-out test set of expert-labeled clips. The primary outcome was comparing the labeling concordance of collated crowd opinions to trained experts in classifying B-lines on lung ultrasound clips.
RESULTS: Our clinical data set included patients with a mean age of 60.0 (SD 19.0) years; 105 (51.7%) patients were female and 114 (56.1%) patients were White. Over the 195 training clips, the expert-consensus label distribution was 114 (58%) no B-lines, 56 (29%) discrete B-lines, and 25 (13%) confluent B-lines. Over the 198 test clips, expert-consensus label distribution was 138 (70%) no B-lines, 36 (18%) discrete B-lines, and 24 (12%) confluent B-lines. In total, 99,238 opinions were collected from 426 unique users. On a test set of 198 clips, the mean labeling concordance of individual experts relative to the reference standard was 85.0% (SE 2.0), compared with 87.9% crowdsourced label concordance (P=.15). When individual experts\' opinions were compared with reference standard labels created by majority vote excluding their own opinion, crowd concordance was higher than the mean concordance of individual experts to reference standards (87.4% vs 80.8%, SE 1.6 for expert concordance; P<.001). Clips with discrete B-lines had the most disagreement from both the crowd consensus and individual experts with the expert consensus. Using randomly sampled subsets of crowd opinions, 7 quality-filtered opinions were sufficient to achieve near the maximum crowd concordance.
CONCLUSIONS: Crowdsourced labels for B-line classification on lung ultrasound clips via a gamified approach achieved expert-level accuracy. This suggests a strategic role for gamified crowdsourcing in efficiently generating labeled image data sets for training ML systems.
摘要:
背景:机器学习(ML)模型可以产生更快,更准确的医疗诊断;但是,开发ML模型受到缺乏高质量标记训练数据的限制。众包标签是一种潜在的解决方案,但可能会受到对标签质量的担忧的限制。
目的:本研究旨在研究具有持续绩效评估的游戏化众包平台,用户反馈,基于绩效的激励措施可以在医学影像数据上产生专家质量标签。
方法:在这项诊断比较研究中,回顾性收集了203例急诊科患者的2384例肺超声夹。共有6位肺部超声专家将这些夹子中的393个归类为没有B线,一条或多条离散的B线,或融合的B线创建2套参考标准数据集(195个训练剪辑和198个测试剪辑)。集合分别用于(1)在游戏化的众包平台上训练用户,以及(2)将所得人群标签的一致性与各个专家与参考标准的一致性进行比较。人群意见来自DiagnosUs(Centaur实验室)iOS应用程序用户超过8天,根据过去的性能进行过滤,使用多数规则聚合,并分析了与专家标记的夹子的固定测试集相比的标签一致性。主要结果是将经过整理的人群意见的标签一致性与训练有素的专家比较,以对肺部超声夹子上的B线进行分类。
结果:我们的临床数据集包括平均年龄为60.0(SD19.0)岁的患者;105例(51.7%)患者为女性,114例(56.1%)患者为白人。在195个训练剪辑中,专家共识标签分布为114(58%)无B线,56(29%)离散B线,和25(13%)融合的B系。在198个测试夹上,专家共识标签分布为138(70%)无B线,36条(18%)离散B线,和24(12%)融合的B系。总的来说,收集了426个独特用户的99,238条意见。在198个夹子的测试集上,个别专家相对于参考标准的平均标签一致性为85.0%(SE2.0),与87.9%的众包标签一致性相比(P=0.15)。当个别专家的意见与参考标准标签进行比较时,多数投票创建的不包括他们自己的意见,人群一致性高于个别专家对参考标准的平均一致性(87.4%vs80.8%,SE1.6表示专家一致性;P<.001)。具有离散B线的剪辑在人群共识和专家共识中的分歧最大。使用随机抽样的人群意见子集,7种经过质量过滤的意见足以达到接近最大的人群一致性。
结论:通过游戏化方法对肺部超声夹进行B线分类的众包标签达到了专家级的准确性。这表明游戏化众包在有效生成用于训练ML系统的标记图像数据集方面具有战略作用。
公众号