OBJECTIVE: We aimed to assess the feasibility of using LLMs to generate relevant, accurate, helpful, and unharmful responses to laboratory test-related questions asked by patients and identify potential issues that can be mitigated using augmentation approaches.
METHODS: We collected laboratory test result-related Q&A data from Yahoo! Answers and selected 53 Q&A pairs for this study. Using the LangChain framework and ChatGPT web portal, we generated responses to the 53 questions from 5 LLMs: GPT-4, GPT-3.5, LLaMA 2, MedAlpaca, and ORCA_mini. We assessed the similarity of their answers using standard Q&A similarity-based evaluation metrics, including Recall-Oriented Understudy for Gisting Evaluation, Bilingual Evaluation Understudy, Metric for Evaluation of Translation With Explicit Ordering, and Bidirectional Encoder Representations from Transformers Score. We used an LLM-based evaluator to judge whether a target model had higher quality in terms of relevance, correctness, helpfulness, and safety than the baseline model. We performed a manual evaluation with medical experts for all the responses to 7 selected questions on the same 4 aspects.
RESULTS: Regarding the similarity of the responses from 4 LLMs; the GPT-4 output was used as the reference answer, the responses from GPT-3.5 were the most similar, followed by those from LLaMA 2, ORCA_mini, and MedAlpaca. Human answers from Yahoo data were scored the lowest and, thus, as the least similar to GPT-4-generated answers. The results of the win rate and medical expert evaluation both showed that GPT-4\'s responses achieved better scores than all the other LLM responses and human responses on all 4 aspects (relevance, correctness, helpfulness, and safety). LLM responses occasionally also suffered from lack of interpretation in one\'s medical context, incorrect statements, and lack of references.
CONCLUSIONS: By evaluating LLMs in generating responses to patients\' laboratory test result-related questions, we found that, compared to other 4 LLMs and human answers from a Q&A website, GPT-4\'s responses were more accurate, helpful, relevant, and safer. There were cases in which GPT-4 responses were inaccurate and not individualized. We identified a number of ways to improve the quality of LLM responses, including prompt engineering, prompt augmentation, retrieval-augmented generation, and response evaluation.
目的:我们旨在评估使用LLM来生成相关的,准确,乐于助人,以及对患者提出的实验室测试相关问题的无害反应,并确定可以使用增强方法缓解的潜在问题。
方法:我们从Yahoo!Answers收集了实验室测试结果相关的问答数据,并为本研究选择了53对问答。使用LangChain框架和ChatGPT门户网站,我们从5个LLM产生了对53个问题的回答:GPT-4,GPT-3.5,LLaMA2,MedAlpaca,和ORCA_mini。我们使用基于标准Q&A相似性的评估指标评估他们答案的相似性,包括召回导向的激励评估,双语评价研究,用显式排序评价翻译的度量,和双向编码器表示从变形金刚得分。我们使用基于LLM的评估器来判断目标模型在相关性方面是否具有更高的质量,正确性,乐于助人,和安全性比基线模型。我们与医学专家一起对相同4个方面的7个选定问题的所有回答进行了手动评估。
结果:关于来自4个LLM的响应的相似性;GPT-4输出被用作参考答案,GPT-3.5的反应最相似,其次是LLaMA2,ORCA_mini,和MedAlpaca.雅虎数据中的人类答案得分最低,因此,与GPT-4生成的答案最不相似。胜率和医学专家评估的结果都表明,GPT-4的反应在所有4个方面都比所有其他LLM反应和人类反应获得了更好的分数(相关性,正确性,乐于助人,和安全)。LLM反应偶尔也会因缺乏医学背景下的解释而受到影响,不正确的陈述,缺乏参考。
结论:通过评估LLM对患者实验室测试结果相关问题的反应,我们发现,与问答网站上的其他4个LLM和人类答案相比,GPT-4的反应更准确,乐于助人,相关,和更安全。存在GPT-4应答不准确且未个体化的情况。我们确定了许多提高LLM响应质量的方法,包括及时的工程,提示增强,检索增强生成,和响应评估。