实验室结果往往令人困惑,难以理解。诸如ChatGPT之类的大型语言模型(LLM)为患者提供了一条有希望的途径来回答他们的问题。我们的目标是评估使用LLM生成相关、准确,乐于助人,以及对患者提出的实验室测试相关问题的无害响应,并确定可以通过增强方法缓解的潜在问题。我们首先从Yahoo!Answers收集了实验室测试结果相关的问答数据,并为本研究选择了53对QA。使用LangChain框架和ChatGPT门户网站,我们从四个LLM中产生了对53个问题的回答,包括GPT-4、MetaLLaMA2、MedAlpaca、和ORCA_mini。我们首先使用标准的基于QA相似性的评估指标评估他们答案的相似性,包括ROUGE,BLEU,METEOR,Bertscore.我们还利用基于LLM的评估器判断目标模型是否在相关性方面具有更高的质量,正确性,乐于助人,和安全性比基线模型。最后,我们与医学专家进行了手动评估,对相同的四个方面的七个选定问题的所有回答。WinRate和医学专家评估的结果都表明,GPT-4的反应在所有四个方面都比所有其他LLM反应和人类反应获得更好的分数(相关性,正确性,乐于助人,和安全)。然而,LLM的反应偶尔也会在一个人的医学环境中缺乏解释,不正确的陈述,缺乏参考。我们发现,与问答网站上的其他三个LLM和人类答案相比,GPT-4的回答更准确,乐于助人,相关,和更安全。然而,在某些情况下,GPT-4应答不准确且未个体化.我们确定了许多提高LLM响应质量的方法。
UNASSIGNED: Even though patients have easy access to their electronic health records and lab test results data through patient portals, lab results are often confusing and hard to understand. Many patients turn to online forums or question and answering (Q&A) sites to seek advice from their peers. However, the quality of answers from social Q&A on health-related questions varies significantly, and not all the responses are accurate or reliable. Large language models (LLMs) such as ChatGPT have opened a promising avenue for patients to get their questions answered.
UNASSIGNED: We aim to assess the feasibility of using LLMs to generate relevant, accurate, helpful, and unharmful responses to lab test-related questions asked by patients and to identify potential issues that can be mitigated with augmentation approaches.
UNASSIGNED: We first collected lab test results related question and answer data from Yahoo! Answers and selected 53 Q&A pairs for this
study. Using the LangChain framework and ChatGPT web portal, we generated responses to the 53 questions from four LLMs including GPT-4, Meta LLaMA 2, MedAlpaca, and ORCA_mini. We first assessed the similarity of their answers using standard QA similarity-based evaluation metrics including ROUGE, BLEU, METEOR, BERTScore. We also utilized an LLM-based evaluator to judge whether a target model has higher quality in terms of relevance, correctness, helpfulness, and safety than the baseline model. Finally, we performed a manual evaluation with medical experts for all the responses of seven selected questions on the same four aspects.
UNASSIGNED: Regarding the similarity of the responses from 4 LLMs, where GPT-4 output was used as the reference answer, the responses from LLaMa 2 are the most similar ones, followed by LLaMa 2, ORCA_mini, and MedAlpaca. Human answers from Yahoo data were scored lowest and thus least similar to GPT-4-generated answers. The results of Win Rate and medical expert evaluation both showed that GPT-4\'s responses achieved better scores than all the other LLM responses and human responses on all the four aspects (relevance, correctness, helpfulness, and safety). However, LLM responses occasionally also suffer from lack of interpretation in one\'s medical context, incorrect statements, and lack of references.
UNASSIGNED: By evaluating LLMs in generating responses to patients\' lab test results related questions, we find that compared to other three LLMs and human answer from the Q&A website, GPT-4\'s responses are more accurate, helpful, relevant, and safer. However, there are cases that GPT-4 responses are inaccurate and not individualized. We identified a number of ways to improve the quality of LLM responses including prompt engineering, prompt augmentation, retrieval augmented generation, and response evaluation.