生成型大语言模型与同行用户的答案质量，用于解释患者的实验室测试结果：评估研究。Quality of Answers of Generative Large Language Models Versus Peer Users for Interpreting Laboratory Test Results for Lay Patients: Evaluation Study.-医云文献数字医云科研云海量医学决策数据服务

Abstract：

BACKGROUND: Although patients have easy access to their electronic health records and laboratory test result data through patient portals, laboratory test results are often confusing and hard to understand. Many patients turn to web-based forums or question-and-answer (Q&A) sites to seek advice from their peers. The quality of answers from social Q&A sites on health-related questions varies significantly, and not all responses are accurate or reliable. Large language models (LLMs) such as ChatGPT have opened a promising avenue for patients to have their questions answered.
OBJECTIVE: We aimed to assess the feasibility of using LLMs to generate relevant, accurate, helpful, and unharmful responses to laboratory test-related questions asked by patients and identify potential issues that can be mitigated using augmentation approaches.
METHODS: We collected laboratory test result-related Q&A data from Yahoo! Answers and selected 53 Q&A pairs for this study. Using the LangChain framework and ChatGPT web portal, we generated responses to the 53 questions from 5 LLMs: GPT-4, GPT-3.5, LLaMA 2, MedAlpaca, and ORCA_mini. We assessed the similarity of their answers using standard Q&A similarity-based evaluation metrics, including Recall-Oriented Understudy for Gisting Evaluation, Bilingual Evaluation Understudy, Metric for Evaluation of Translation With Explicit Ordering, and Bidirectional Encoder Representations from Transformers Score. We used an LLM-based evaluator to judge whether a target model had higher quality in terms of relevance, correctness, helpfulness, and safety than the baseline model. We performed a manual evaluation with medical experts for all the responses to 7 selected questions on the same 4 aspects.
RESULTS: Regarding the similarity of the responses from 4 LLMs; the GPT-4 output was used as the reference answer, the responses from GPT-3.5 were the most similar, followed by those from LLaMA 2, ORCA_mini, and MedAlpaca. Human answers from Yahoo data were scored the lowest and, thus, as the least similar to GPT-4-generated answers. The results of the win rate and medical expert evaluation both showed that GPT-4\'s responses achieved better scores than all the other LLM responses and human responses on all 4 aspects (relevance, correctness, helpfulness, and safety). LLM responses occasionally also suffered from lack of interpretation in one\'s medical context, incorrect statements, and lack of references.
CONCLUSIONS: By evaluating LLMs in generating responses to patients\' laboratory test result-related questions, we found that, compared to other 4 LLMs and human answers from a Q&A website, GPT-4\'s responses were more accurate, helpful, relevant, and safer. There were cases in which GPT-4 responses were inaccurate and not individualized. We identified a number of ways to improve the quality of LLM responses, including prompt engineering, prompt augmentation, retrieval-augmented generation, and response evaluation.

摘要：

背景：尽管患者可以通过患者门户轻松访问其电子健康记录和实验室检查结果数据，实验室测试结果往往令人困惑和难以理解。许多患者转向基于网络的论坛或问答（Q＆A）网站，以寻求同龄人的建议。社交问答网站对健康相关问题的回答质量差异很大，并不是所有的反应都是准确或可靠的。诸如ChatGPT之类的大型语言模型（LLM）为患者提供了一条有希望的途径来回答他们的问题。
目的：我们旨在评估使用LLM来生成相关的，准确,乐于助人,以及对患者提出的实验室测试相关问题的无害反应，并确定可以使用增强方法缓解的潜在问题。
方法：我们从Yahoo!Answers收集了实验室测试结果相关的问答数据，并为本研究选择了53对问答。使用LangChain框架和ChatGPT门户网站，我们从5个LLM产生了对53个问题的回答：GPT-4，GPT-3.5，LLaMA2，MedAlpaca，和ORCA_mini。我们使用基于标准Q&A相似性的评估指标评估他们答案的相似性，包括召回导向的激励评估，双语评价研究，用显式排序评价翻译的度量，和双向编码器表示从变形金刚得分。我们使用基于LLM的评估器来判断目标模型在相关性方面是否具有更高的质量，正确性，乐于助人,和安全性比基线模型。我们与医学专家一起对相同4个方面的7个选定问题的所有回答进行了手动评估。
结果：关于来自4个LLM的响应的相似性；GPT-4输出被用作参考答案，GPT-3.5的反应最相似，其次是LLaMA2，ORCA_mini，和MedAlpaca.雅虎数据中的人类答案得分最低，因此，与GPT-4生成的答案最不相似。胜率和医学专家评估的结果都表明，GPT-4的反应在所有4个方面都比所有其他LLM反应和人类反应获得了更好的分数(相关性，正确性，乐于助人,和安全)。LLM反应偶尔也会因缺乏医学背景下的解释而受到影响，不正确的陈述，缺乏参考。
结论：通过评估LLM对患者实验室测试结果相关问题的反应，我们发现,与问答网站上的其他4个LLM和人类答案相比，GPT-4的反应更准确，乐于助人,相关,和更安全。存在GPT-4应答不准确且未个体化的情况。我们确定了许多提高LLM响应质量的方法，包括及时的工程，提示增强，检索增强生成，和响应评估。