关键词: Bard ChatGPT GenAI digital health generative artificial intelligence qualitative research thematic analysis

来  源:   DOI:10.2196/54482   PDF(Pubmed)

Abstract:
BACKGROUND: Qualitative methods are incredibly beneficial to the dissemination and implementation of new digital health interventions; however, these methods can be time intensive and slow down dissemination when timely knowledge from the data sources is needed in ever-changing health systems. Recent advancements in generative artificial intelligence (GenAI) and their underlying large language models (LLMs) may provide a promising opportunity to expedite the qualitative analysis of textual data, but their efficacy and reliability remain unknown.
OBJECTIVE: The primary objectives of our study were to evaluate the consistency in themes, reliability of coding, and time needed for inductive and deductive thematic analyses between GenAI (ie, ChatGPT and Bard) and human coders.
METHODS: The qualitative data for this study consisted of 40 brief SMS text message reminder prompts used in a digital health intervention for promoting antiretroviral medication adherence among people with HIV who use methamphetamine. Inductive and deductive thematic analyses of these SMS text messages were conducted by 2 independent teams of human coders. An independent human analyst conducted analyses following both approaches using ChatGPT and Bard. The consistency in themes (or the extent to which the themes were the same) and reliability (or agreement in coding of themes) between methods were compared.
RESULTS: The themes generated by GenAI (both ChatGPT and Bard) were consistent with 71% (5/7) of the themes identified by human analysts following inductive thematic analysis. The consistency in themes was lower between humans and GenAI following a deductive thematic analysis procedure (ChatGPT: 6/12, 50%; Bard: 7/12, 58%). The percentage agreement (or intercoder reliability) for these congruent themes between human coders and GenAI ranged from fair to moderate (ChatGPT, inductive: 31/66, 47%; ChatGPT, deductive: 22/59, 37%; Bard, inductive: 20/54, 37%; Bard, deductive: 21/58, 36%). In general, ChatGPT and Bard performed similarly to each other across both types of qualitative analyses in terms of consistency of themes (inductive: 6/6, 100%; deductive: 5/6, 83%) and reliability of coding (inductive: 23/62, 37%; deductive: 22/47, 47%). On average, GenAI required significantly less overall time than human coders when conducting qualitative analysis (20, SD 3.5 min vs 567, SD 106.5 min).
CONCLUSIONS: The promising consistency in the themes generated by human coders and GenAI suggests that these technologies hold promise in reducing the resource intensiveness of qualitative thematic analysis; however, the relatively lower reliability in coding between them suggests that hybrid approaches are necessary. Human coders appeared to be better than GenAI at identifying nuanced and interpretative themes. Future studies should consider how these powerful technologies can be best used in collaboration with human coders to improve the efficiency of qualitative research in hybrid approaches while also mitigating potential ethical risks that they may pose.
摘要:
背景:定性方法对于传播和实施新的数字健康干预措施非常有益;但是,当在不断变化的卫生系统中需要来自数据源的及时知识时,这些方法可能是时间密集的,并且会减慢传播速度。生成人工智能(GenAI)及其基础大型语言模型(LLM)的最新进展可能为加快文本数据的定性分析提供了一个有希望的机会。但它们的有效性和可靠性仍然未知。
目的:我们研究的主要目的是评估主题的一致性,编码的可靠性,以及GenAI之间归纳和演绎主题分析所需的时间(即,ChatGPT和Bard)和人类编码器。
方法:本研究的定性数据包括40个简短的SMS短信提示提示,这些提示用于数字健康干预中,用于促进使用甲基苯丙胺的HIV感染者的抗逆转录病毒药物依从性。这些SMS文本消息的归纳和演绎主题分析是由2个独立的人类编码团队进行的。一位独立的人类分析师使用ChatGPT和Bard两种方法进行了分析。比较了方法之间主题的一致性(或主题相同的程度)和可靠性(或主题编码的一致性)。
结果:GenAI(ChatGPT和Bard)产生的主题与人类分析人员在归纳主题分析后确定的主题的71%(5/7)一致。在演绎主题分析程序之后,人类与GenAI之间的主题一致性较低(ChatGPT:6/12,50%;Bard:7/12,58%)。人类编码员和GenAI之间这些一致主题的百分比一致性(或互码可靠性)范围从公平到中等(ChatGPT,感应:31/66,47%;ChatGPT,演绎:22/59,37%;巴德,感应:20/54,37%;巴德,演绎:21/58,36%)。总的来说,就主题的一致性(归纳:6/6,100%;演绎:5/6,83%)和编码的可靠性(归纳:23/62,37%;演绎:22/47,47%)而言,ChatGPT和Bard在两种类型的定性分析中的表现相似。平均而言,进行定性分析时,GenAI所需的总时间明显少于人类编码器(20,SD3.5分钟vs567,SD106.5分钟)。
结论:人类编码员和GenAI产生的主题具有良好的一致性,这表明这些技术有望减少定性主题分析的资源密集型;然而,它们之间的编码可靠性相对较低,这表明混合方法是必要的。在识别细微差别和解释性主题方面,人类程序员似乎比GenAI更好。未来的研究应该考虑如何与人类程序员合作最好地使用这些强大的技术,以提高混合方法定性研究的效率,同时减轻它们可能带来的潜在道德风险。
公众号