根据阅读水平评估大型语言模型在生成皮肤科患者教育材料中的应用：定性研究。Assessing the Application of Large Language Models in Generating Dermatologic Patient Education Materials According to Reading Level: Qualitative Study.-医云文献数字医云科研云海量医学决策数据服务

Abstract：

BACKGROUND: Dermatologic patient education materials (PEMs) are often written above the national average seventh- to eighth-grade reading level. ChatGPT-3.5, GPT-4, DermGPT, and DocsGPT are large language models (LLMs) that are responsive to user prompts. Our project assesses their use in generating dermatologic PEMs at specified reading levels.
OBJECTIVE: This study aims to assess the ability of select LLMs to generate PEMs for common and rare dermatologic conditions at unspecified and specified reading levels. Further, the study aims to assess the preservation of meaning across such LLM-generated PEMs, as assessed by dermatology resident trainees.
METHODS: The Flesch-Kincaid reading level (FKRL) of current American Academy of Dermatology PEMs was evaluated for 4 common (atopic dermatitis, acne vulgaris, psoriasis, and herpes zoster) and 4 rare (epidermolysis bullosa, bullous pemphigoid, lamellar ichthyosis, and lichen planus) dermatologic conditions. We prompted ChatGPT-3.5, GPT-4, DermGPT, and DocsGPT to \"Create a patient education handout about [condition] at a [FKRL]\" to iteratively generate 10 PEMs per condition at unspecified fifth- and seventh-grade FKRLs, evaluated with Microsoft Word readability statistics. The preservation of meaning across LLMs was assessed by 2 dermatology resident trainees.
RESULTS: The current American Academy of Dermatology PEMs had an average (SD) FKRL of 9.35 (1.26) and 9.50 (2.3) for common and rare diseases, respectively. For common diseases, the FKRLs of LLM-produced PEMs ranged between 9.8 and 11.21 (unspecified prompt), between 4.22 and 7.43 (fifth-grade prompt), and between 5.98 and 7.28 (seventh-grade prompt). For rare diseases, the FKRLs of LLM-produced PEMs ranged between 9.85 and 11.45 (unspecified prompt), between 4.22 and 7.43 (fifth-grade prompt), and between 5.98 and 7.28 (seventh-grade prompt). At the fifth-grade reading level, GPT-4 was better at producing PEMs for both common and rare conditions than ChatGPT-3.5 (P=.001 and P=.01, respectively), DermGPT (P<.001 and P=.03, respectively), and DocsGPT (P<.001 and P=.02, respectively). At the seventh-grade reading level, no significant difference was found between ChatGPT-3.5, GPT-4, DocsGPT, or DermGPT in producing PEMs for common conditions (all P>.05); however, for rare conditions, ChatGPT-3.5 and DocsGPT outperformed GPT-4 (P=.003 and P<.001, respectively). The preservation of meaning analysis revealed that for common conditions, DermGPT ranked the highest for overall ease of reading, patient understandability, and accuracy (14.75/15, 98%); for rare conditions, handouts generated by GPT-4 ranked the highest (14.5/15, 97%).
CONCLUSIONS: GPT-4 appeared to outperform ChatGPT-3.5, DocsGPT, and DermGPT at the fifth-grade FKRL for both common and rare conditions, although both ChatGPT-3.5 and DocsGPT performed better than GPT-4 at the seventh-grade FKRL for rare conditions. LLM-produced PEMs may reliably meet seventh-grade FKRLs for select common and rare dermatologic conditions and are easy to read, understandable for patients, and mostly accurate. LLMs may play a role in enhancing health literacy and disseminating accessible, understandable PEMs in dermatology.

摘要：

背景：皮肤科患者教育材料（PEM）的书写水平通常高于全国平均水平的七至八年级阅读水平。ChatGPT-3.5,GPT-4,DermGPT,和DocsGPT是响应用户提示的大型语言模型(LLM)。我们的项目评估了它们在指定阅读水平下生成皮肤病学PEM的用途。
目的：本研究旨在评估在未指定和指定的阅读水平下，选择LLM在常见和罕见皮肤病学中产生PEM的能力。Further,该研究旨在评估这些LLM生成的PEM的意义保存情况，由皮肤科住院医师评估。
方法：当前美国皮肤病学会PEMs的Flesch-Kincaid阅读水平（FKRL）评估了4种常见（特应性皮炎，寻常痤疮,牛皮癣,和带状疱疹）和4例罕见（大疱性表皮松解症，大疱性类天疱疮,层状鱼鳞病,和扁平苔藓）皮肤病。我们提示ChatGPT-3.5，GPT-4，DermGPT，和DocsGPT以“在[FKRL]中创建关于[条件]的患者教育讲义”，以在未指定的五年级和七年级FKRL中每个条件迭代生成10个PEM，使用MicrosoftWord可读性统计进行评估。由2名皮肤科住院医师评估了LLM中意义的保留。
结果：当前的美国皮肤病学会PEMs对常见和罕见疾病的平均（SD）FKRL为9.35（1.26）和9.50（2.3），分别。对于常见疾病，LLM生产的PEM的FKRL介于9.8和11.21之间(未指定提示)，在4.22和7.43之间（五年级提示），在5.98和7.28之间（七年级提示）。对于罕见疾病，LLM生产的PEM的FKRL范围在9.85和11.45之间(未指定提示)，在4.22和7.43之间（五年级提示），在5.98和7.28之间（七年级提示）。在五年级阅读水平，与ChatGPT-3.5相比，GPT-4在常见和罕见条件下都能更好地生产PEM（分别为P=.001和P=.01），DermGPT（分别为P<.001和P=.03），和DocsGPT（分别为P<.001和P=.02）。在七年级的阅读水平，ChatGPT-3.5、GPT-4、DocsGPT、或DermGPT在生产常见条件下的PEM(所有P>.05)；然而，对于罕见的情况，ChatGPT-3.5和DocsGPT的表现优于GPT-4（分别为P=.003和P<.001）。意义分析的保留表明，对于共同条件，DermGPT在整体阅读便利性方面排名最高，患者的可理解性，和准确性(14.75/15，98%)；对于罕见的情况，GPT-4产生的施舍排名最高(14.5/15，97%)。
结论：GPT-4的表现似乎优于ChatGPT-3.5，DocsGPT，和DermGPT在五年级FKRL的常见和罕见的情况下，尽管ChatGPT-3.5和DocsGPT在7级FKRL中在罕见情况下的表现均优于GPT-4。LLM生产的PEM可以可靠地满足七级FKRL的选择常见和罕见的皮肤病，并且易于阅读，患者可以理解，而且大多是准确的。LLM可能在提高健康素养和传播无障碍方面发挥作用，在皮肤病学中可以理解的PEM。