Large language model

大型语言模型
  • 文章类型: Letter
    这封信评估了Gravina等人关于ChatGPT在为炎症性肠病患者提供医疗信息方面的潜力的文章。虽然有希望,它强调了对推理+动作和检索增强生成等先进技术的需求,以提高准确性和可靠性。强调简单的问答测试是不够的,它需要更细致的评估方法,以真正衡量大型语言模型在临床应用中的能力。
    This letter evaluates the article by Gravina et al on ChatGPT\'s potential in providing medical information for inflammatory bowel disease patients. While promising, it highlights the need for advanced techniques like reasoning + action and retrieval-augmented generation to improve accuracy and reliability. Emphasizing that simple question and answer testing is insufficient, it calls for more nuanced evaluation methods to truly gauge large language models\' capabilities in clinical applications.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    聊天机器人,基于大型语言模型,越来越多地用于公共卫生。然而,聊天机器人响应的有效性一直存在争议,它们在近视预防和控制方面的表现尚未得到充分探索。本研究旨在评估三个著名的聊天机器人ChatGPT的有效性,克劳德,和Bard-in回应有关近视的公共卫生问题。
    关于近视的19个公共卫生问题(包括三个政策主题,基础知识和措施)由三个聊天机器人单独回应。洗牌后,每个聊天机器人响应由4名评估者独立评估全面性,准确性和相关性。
    这项研究的问题经过了可靠的测试。所有3个聊天机器人的单词计数响应之间存在显着差异。从最多到最少,订单是ChatGPT,巴德,还有克劳德.所有3个聊天机器人的综合得分都超过5分中的4分。ChatGPT在评估的所有方面得分最高。然而,所有聊天机器人都表现出缺点,比如给出捏造的回应。
    聊天机器人在公共卫生领域显示出巨大潜力,ChatGPT是最好的。未来使用聊天机器人作为公共卫生工具将需要快速开发其使用和监测标准,以及持续的研究,聊天机器人的评估和改进。
    UNASSIGNED:  Chatbots, which are based on large language models, are increasingly being used in public health. However, the effectiveness of chatbot responses has been debated, and their performance in myopia prevention and control has not been fully explored. This study aimed to evaluate the effectiveness of three well-known chatbots-ChatGPT, Claude, and Bard-in responding to public health questions about myopia.
    UNASSIGNED:  Nineteen public health questions about myopia (including three topics of policy, basics and measures) were responded individually by three chatbots. After shuffling the order, each chatbot response was independently rated by 4 raters for comprehensiveness, accuracy and relevance.
    UNASSIGNED:  The study\'s questions have undergone reliable testing. There was a significant difference among the word count responses of all 3 chatbots. From most to least, the order was ChatGPT, Bard, and Claude. All 3 chatbots had a composite score above 4 out of 5. ChatGPT scored the highest in all aspects of the assessment. However, all chatbots exhibit shortcomings, such as giving fabricated responses.
    UNASSIGNED:  Chatbots have shown great potential in public health, with ChatGPT being the best. The future use of chatbots as a public health tool will require rapid development of standards for their use and monitoring, as well as continued research, evaluation and improvement of chatbots.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    服务的定义已经从2000年代之前对制造业中物质价值的关注发展到以服务业的显着增长为基础的以客户为中心的价值。由于通过第四次工业革命和COVID-19将数字技术纳入其中,数字化转型对服务行业的公司至关重要。这项研究利用变压器(BERT)的双向编码器表示来分析2000年至2022年间注册的3029项与客户服务行业和数字化转型相关的国际专利。通过主题建模,这项研究确定了客户服务行业的10个主要主题,并分析了它们的年度趋势。我们的研究结果表明,截至2022年,频率最高的趋势是以用户为中心的网络服务设计,而云计算在过去五年中经历了最急剧的增长。自互联网诞生以来,以用户为中心的网络服务一直在稳步发展。云计算是2023年为客户服务数字化转型而大力开发的关键技术之一。这项研究确定了客户服务行业专利的时间序列趋势,并提出了使用BERTopic预测技术未来趋势的有效性。
    The definition of service has evolved from a focus on material value in manufacturing before the 2000s to a customer-centric value based on the significant growth of the service industry. Digital transformation has become essential for companies in the service industry due to the incorporation of digital technology through the Fourth Industrial Revolution and COVID-19. This study utilised Bidirectional Encoder Representations from Transformer (BERT) to analyse 3029 international patents related to the customer service industry and digital transformation registered between 2000 and 2022. Through topic modelling, this study identified 10 major topics in the customer service industry and analysed their yearly trends. Our findings show that as of 2022, the trend with the highest frequency is user-centric network service design, while cloud computing has experienced the steepest increase in the last five years. User-centric network services have been steadily developing since the inception of the Internet. Cloud computing is one of the key technologies being developed intensively in 2023 for the digital transformation of customer service. This study identifies time series trends of customer service industry patents and suggests the effectiveness of using BERTopic to predict future trends in technology.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    聊天生成器预训练变压器(ChatGPT-4)的消费者可用性和自动响应功能,一个大的语言模型,平衡此应用程序用于患者的健康查询,并可能作为一个辅助作用,以尽量减少行政和临床负担。
    评估ChatGPT-4响应患者有关尺侧副韧带(UCL)损伤的询问的能力,并将这些结果与Google的表现进行比较。
    横断面研究。
    GoogleWebSearch被用作基准,因为它是全球使用最广泛的搜索引擎,也是唯一在查询提示时生成常见问题(FAQ)的搜索引擎,允许通过系统的方法进行比较。查询“尺侧副韧带重建”已输入谷歌,和十大常见问题解答,答案,他们的消息来源被记录下来。系统会提示ChatGPT-4使用相同的查询对FAQ进行Google搜索,并记录答案的来源以进行比较。再次重复此过程,以获得10个需要数字而不是开放式答复的新问题。最后,对临床准确性的反应进行独立分级(0级=不准确,1级=有点准确,2级=准确)由2名受奖学金培训的运动医学博士(D.W.A,J.S.D.)对搜索引擎和答案来源视而不见。
    ChatGPT-4使用了比Google更大比例的学术资源来提供前10个常见问题解答的答案,尽管这没有统计学意义(90%vs50%;P=.14)。就问题重叠而言,Google和ChatGPT-4上最常见的问题中有40%是相同的。比较常见问题和数字响应时,20%的答案完全重叠,30%显示部分重叠,其余50%未显示任何重叠.ChatGPT-4用来回答这些常见问题的所有来源都是学术性的,而谷歌使用的来源只有20%是学术性的(P=.0007)。剩下的Google来源包括社交媒体(40%),医疗实践(20%),单外科医生网站(10%),和商业网站(10%)。与Google相比,ChatGPT-4给出的答案的平均(±标准偏差)准确性显着提高了前10个常见问题解答(1.9±0.2vs1.2±0.6;P=.001)和前10个带有数字答案的问题(1.8±0.4vs1±0.8;P=.013)。
    ChatGPT-4能够提供与UCL损伤和重建有关的临床相关内容的反应。与GoogleWebSearch相比,ChatGPT-4使用了更大比例的学术网站来提供对代表患者查询的常见问题解答的回复,并提供了更准确的答案。往前走,在回答有关UCL损伤和重建的问题时,ChatGPT有可能用作临床辅助手段。但在临床环境中综合或自主使用之前,需要进一步验证.
    UNASSIGNED: The consumer availability and automated response functions of chat generator pretrained transformer (ChatGPT-4), a large language model, poise this application to be utilized for patient health queries and may have a role in serving as an adjunct to minimize administrative and clinical burden.
    UNASSIGNED: To evaluate the ability of ChatGPT-4 to respond to patient inquiries concerning ulnar collateral ligament (UCL) injuries and compare these results with the performance of Google.
    UNASSIGNED: Cross-sectional study.
    UNASSIGNED: Google Web Search was used as a benchmark, as it is the most widely used search engine worldwide and the only search engine that generates frequently asked questions (FAQs) when prompted with a query, allowing comparisons through a systematic approach. The query \"ulnar collateral ligament reconstruction\" was entered into Google, and the top 10 FAQs, answers, and their sources were recorded. ChatGPT-4 was prompted to perform a Google search of FAQs with the same query and to record the sources of answers for comparison. This process was again replicated to obtain 10 new questions requiring numeric instead of open-ended responses. Finally, responses were graded independently for clinical accuracy (grade 0 = inaccurate, grade 1 = somewhat accurate, grade 2 = accurate) by 2 fellowship-trained sports medicine surgeons (D.W.A, J.S.D.) blinded to the search engine and answer source.
    UNASSIGNED: ChatGPT-4 used a greater proportion of academic sources than Google to provide answers to the top 10 FAQs, although this was not statistically significant (90% vs 50%; P = .14). In terms of question overlap, 40% of the most common questions on Google and ChatGPT-4 were the same. When comparing FAQs with numeric responses, 20% of answers were completely overlapping, 30% demonstrated partial overlap, and the remaining 50% did not demonstrate any overlap. All sources used by ChatGPT-4 to answer these FAQs were academic, while only 20% of sources used by Google were academic (P = .0007). The remaining Google sources included social media (40%), medical practices (20%), single-surgeon websites (10%), and commercial websites (10%). The mean (± standard deviation) accuracy for answers given by ChatGPT-4 was significantly greater compared with Google for the top 10 FAQs (1.9 ± 0.2 vs 1.2 ± 0.6; P = .001) and top 10 questions with numeric answers (1.8 ± 0.4 vs 1 ± 0.8; P = .013).
    UNASSIGNED: ChatGPT-4 is capable of providing responses with clinically relevant content concerning UCL injuries and reconstruction. ChatGPT-4 utilized a greater proportion of academic websites to provide responses to FAQs representative of patient inquiries compared with Google Web Search and provided significantly more accurate answers. Moving forward, ChatGPT has the potential to be used as a clinical adjunct when answering queries about UCL injuries and reconstruction, but further validation is warranted before integrated or autonomous use in clinical settings.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    教医学生获得所需的技能,解释,apply,沟通临床信息是医学教育不可或缺的一部分。此过程的一个关键方面涉及为学生提供有关其自由文本临床笔记质量的反馈。
    本研究的目标是评估大型语言模型ChatGPT3.5的能力,对医学生的自由文本历史和身体笔记进行评分。
    这是一个单一的机构,回顾性研究。标准化的患者学到了预先指定的临床病例,作为病人,与医学生互动。每个学生都写了自由文本历史和他们互动的物理笔记。学生的笔记由标准化患者和ChatGPT使用由85个案例元素组成的预先指定的评分规则进行独立评分。准确度的度量是正确的百分比。
    研究人群由168名一年级医学生组成。总共有14,280分。ChatGPT错误得分率为1.0%,标准化患者错误评分率为7.2%。ChatGPT错误率为86%,低于标准化患者错误率。ChatGPT平均不正确得分为12(SD11)显着低于标准化患者平均不正确得分为85(SD74;P=0.002)。
    与标准化患者相比,ChatGPT显示出较低的错误率。这是第一项评估生成预训练变压器(GPT)计划对医学生的标准化基于患者的免费文本临床笔记进行评分的能力的研究。预计,在不久的将来,大型语言模型将为执业医师提供有关其自由文本注释的实时反馈。GPT人工智能程序代表了医学教育和医学实践的重要进步。
    UNASSIGNED: Teaching medical students the skills required to acquire, interpret, apply, and communicate clinical information is an integral part of medical education. A crucial aspect of this process involves providing students with feedback regarding the quality of their free-text clinical notes.
    UNASSIGNED: The goal of this study was to assess the ability of ChatGPT 3.5, a large language model, to score medical students\' free-text history and physical notes.
    UNASSIGNED: This is a single-institution, retrospective study. Standardized patients learned a prespecified clinical case and, acting as the patient, interacted with medical students. Each student wrote a free-text history and physical note of their interaction. The students\' notes were scored independently by the standardized patients and ChatGPT using a prespecified scoring rubric that consisted of 85 case elements. The measure of accuracy was percent correct.
    UNASSIGNED: The study population consisted of 168 first-year medical students. There was a total of 14,280 scores. The ChatGPT incorrect scoring rate was 1.0%, and the standardized patient incorrect scoring rate was 7.2%. The ChatGPT error rate was 86%, lower than the standardized patient error rate. The ChatGPT mean incorrect scoring rate of 12 (SD 11) was significantly lower than the standardized patient mean incorrect scoring rate of 85 (SD 74; P=.002).
    UNASSIGNED: ChatGPT demonstrated a significantly lower error rate compared to standardized patients. This is the first study to assess the ability of a generative pretrained transformer (GPT) program to score medical students\' standardized patient-based free-text clinical notes. It is expected that, in the near future, large language models will provide real-time feedback to practicing physicians regarding their free-text notes. GPT artificial intelligence programs represent an important advance in medical education and medical practice.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    要评估四种大型语言模型(LLM)的性能-GPT-4,PaLM2,Qwen,和百川2-对中国患者关于干眼症(DED)的询问做出回应。
    两阶段研究,包括第一阶段的横截面测试和第二阶段的真实世界临床评估。
    8名获得董事会认证的眼科医生和46名DED患者。
    聊天机器人“对中国患者的反应”对DED的询问进行了评估。在第一阶段,六位资深眼科医生使用5点Likert量表在五个领域对聊天机器人的回答进行主观评价:正确性,完整性,可读性,乐于助人,和安全。使用中文可读性分析平台进行客观可读性分析。在第二阶段,46名DED代表性患者询问了在第一阶段问题中表现最佳的两种语言模型(GPT-4和百川2),然后对答案的满意度和可读性进行了评分。然后,两名高级眼科医生评估了五个领域的反应。
    五个领域的主观得分和第一阶段的客观可读性得分。患者满意度,可读性分数,以及第二阶段五个领域的主观得分。
    在第一阶段,GPT-4在五个领域表现出优异的性能(正确性:4.47;完整性:4.39;可读性:4.47;有用性:4.49;安全性:4.47,p<0.05)。然而,可读性分析表明,GPT-4的反应是高度复杂的,平均得分为12.86(p<0.05),而Qwen的得分为10.87、11.53和11.26,分别为百川2和PaLM2。在第二阶段,如五个领域的分数所示,GPT-4和百川2均擅长回答DED患者提出的问题。然而,百川2的回答的完整性相对较差(4.04与GPT-4为4.48,p<0.05)。然而,百川2的建议比GPT-4的建议更容易理解(患者可读性:3.91vs.4.61,p<0.05;眼科医生可读性:2.67vs.4.33).
    这些发现强调了法学硕士的潜力,特别是GPT-4和百川2,对中国患者关于DED的问题提供准确和全面的回答。
    UNASSIGNED: To evaluate the performance of four large language models (LLMs)-GPT-4, PaLM 2, Qwen, and Baichuan 2-in generating responses to inquiries from Chinese patients about dry eye disease (DED).
    UNASSIGNED: Two-phase study, including a cross-sectional test in the first phase and a real-world clinical assessment in the second phase.
    UNASSIGNED: Eight board-certified ophthalmologists and 46 patients with DED.
    UNASSIGNED: The chatbots\' responses to Chinese patients\' inquiries about DED were assessed by the evaluation. In the first phase, six senior ophthalmologists subjectively rated the chatbots\' responses using a 5-point Likert scale across five domains: correctness, completeness, readability, helpfulness, and safety. Objective readability analysis was performed using a Chinese readability analysis platform. In the second phase, 46 representative patients with DED asked the two language models (GPT-4 and Baichuan 2) that performed best in the in the first phase questions and then rated the answers for satisfaction and readability. Two senior ophthalmologists then assessed the responses across the five domains.
    UNASSIGNED: Subjective scores for the five domains and objective readability scores in the first phase. The patient satisfaction, readability scores, and subjective scores for the five-domains in the second phase.
    UNASSIGNED: In the first phase, GPT-4 exhibited superior performance across the five domains (correctness: 4.47; completeness: 4.39; readability: 4.47; helpfulness: 4.49; safety: 4.47, p < 0.05). However, the readability analysis revealed that GPT-4\'s responses were highly complex, with an average score of 12.86 (p < 0.05) compared to scores of 10.87, 11.53, and 11.26 for Qwen, Baichuan 2, and PaLM 2, respectively. In the second phase, as shown by the scores for the five domains, both GPT-4 and Baichuan 2 were adept in answering questions posed by patients with DED. However, the completeness of Baichuan 2\'s responses was relatively poor (4.04 vs. 4.48 for GPT-4, p < 0.05). Nevertheless, Baichuan 2\'s recommendations more comprehensible than those of GPT-4 (patient readability: 3.91 vs. 4.61, p < 0.05; ophthalmologist readability: 2.67 vs. 4.33).
    UNASSIGNED: The findings underscore the potential of LLMs, particularly that of GPT-4 and Baichuan 2, in delivering accurate and comprehensive responses to questions from Chinese patients about DED.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    大型语言模型(LLM),比如ChatGPT,在各种任务中表现出令人印象深刻的功能,并作为许多领域的自然语言界面引起了越来越多的兴趣。最近,大型视觉语言模型(VLM),可从图像-文本对中学习丰富的视觉-语言相关性,像BLIP-2和GPT-4一样,已经被深入研究。然而,尽管有这些发展,LLM和VLM在图像质量评估(IQA)中的应用,特别是在医学成像方面,仍未探索。这对于客观性能评估和潜在的补充甚至替代放射科医师的意见是有价值的。为此,这项研究介绍了IQAGPT,一种创新的计算机断层扫描(CT)IQA系统,该系统将图像质量字幕VLM与ChatGPT集成在一起,以生成质量评分和文本报告。首先,包含1,000个具有不同质量水平的CT切片的CT-IQA数据集经专业注释和编译用于训练和评估。为了更好地利用LLM的功能,使用提示模板将带注释的质量分数转换为语义丰富的文本描述。第二,图像质量字幕VLM在CT-IQA数据集上进行微调以生成质量描述。字幕模型通过跨模态注意力融合图像和文本特征。第三,根据质量描述,用户口头要求ChatGPT对图像质量评分或生成放射学质量报告。结果证明了使用LLM评估图像质量的可行性。提出的IQAGPT优于GPT-4和CLIP-IQA,以及仅依赖图像的多任务分类和回归模型。
    Large language models (LLMs), such as ChatGPT, have demonstrated impressive capabilities in various tasks and attracted increasing interest as a natural language interface across many domains. Recently, large vision-language models (VLMs) that learn rich vision-language correlation from image-text pairs, like BLIP-2 and GPT-4, have been intensively investigated. However, despite these developments, the application of LLMs and VLMs in image quality assessment (IQA), particularly in medical imaging, remains unexplored. This is valuable for objective performance evaluation and potential supplement or even replacement of radiologists\' opinions. To this end, this study introduces IQAGPT, an innovative computed tomography (CT) IQA system that integrates image-quality captioning VLM with ChatGPT to generate quality scores and textual reports. First, a CT-IQA dataset comprising 1,000 CT slices with diverse quality levels is professionally annotated and compiled for training and evaluation. To better leverage the capabilities of LLMs, the annotated quality scores are converted into semantically rich text descriptions using a prompt template. Second, the image-quality captioning VLM is fine-tuned on the CT-IQA dataset to generate quality descriptions. The captioning model fuses image and text features through cross-modal attention. Third, based on the quality descriptions, users verbally request ChatGPT to rate image-quality scores or produce radiological quality reports. Results demonstrate the feasibility of assessing image quality using LLMs. The proposed IQAGPT outperformed GPT-4 and CLIP-IQA, as well as multitask classification and regression models that solely rely on images.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    目标:聊天生成预训练转换器(ChatGPT)是由OpenAI开发的大型语言模型,已引起广泛的兴趣。它因其对医疗保健的潜在影响及其在医学教育中的有益作用而被引用。然而,医学生对其使用的调查有限。在这项研究中,我们评估了ChatGPT的使用频率,使用动机,以及在美国医学生中,ChatGPT优先于现有资源。
    方法:数据来自一项原始调查,该调查由14个问题组成,评估了ChatGPT在医学教育中各种环境中的使用频率和使用情况。调查通过电子邮件列表分发,群消息应用程序,并为美国各地的医学生进行课堂讲座。答复是在2023年8月至10月之间收集的。
    结果:一百三十一名参与者完成了调查并被纳入分析。在总数中,48.9%的受访者回答他们在医学研究中使用了ChatGPT。在ChatGPT用户中,43.7%的受访者每周报告使用ChatGPT,每周几次,或每天。ChatGPT最常用于写作,修改,编辑,和总结目的。37.5%和41.3%的受访者表示使用ChatGPT的工作时间分别超过这些任务的25%。在没有使用过ChatGPT的受访者中,超过50%的受访者表示,他们极不可能或不太可能在所有调查方案中使用ChatGPT。ChatGPT用户报告说,他们更有可能使用ChatGPT而不是直接询问教授或出席者(45.3%)。教科书(42.2%),和讲座(31.7%),最不可能用于流行的闪存卡应用程序Anki(11.1%)和医学教育视频(9.5%)。
    结论:ChatGPT在医学生中越来越受欢迎,许多人更喜欢ChatGPT而不是教授等其他传统资源,教科书,和讲座。随着能力的提高,它对医学教育的影响只会继续增长。
    OBJECTIVE: Chat Generative Pretrained Transformer (ChatGPT) is a large language model developed by OpenAI that has gained widespread interest. It has been cited for its potential impact on health care and its beneficial role in medical education. However, there is limited investigation into its use among medical students. In this study, we evaluated the frequency of ChatGPT use, motivations for use, and preference for ChatGPT over existing resources among medical students in the United States.
    METHODS: Data was collected from an original survey consisting of 14 questions assessing the frequency and usage of ChatGPT in various contexts within medical education. The survey was distributed via email lists, group messaging applications, and classroom lectures to medical students across the United States. Responses were collected between August and October 2023.
    RESULTS: One hundred thirty-one participants completed the survey and were included in the analysis. Of the total, 48.9% respondents responded that they have used ChatGPT in medical studies. Among ChatGPT users, 43.7% of respondents report using ChatGPT weekly, several times per week, or daily. ChatGPT is most used for writing, revising, editing, and summarizing purposes. 37.5% and 41.3% of respondents reported using ChatGPT more than 25% of the working time for these tasks respectively. Among respondents who have not used ChatGPT, more than 50% of respondents reported they were extremely unlikely or unlikely to use ChatGPT across all surveyed scenarios. ChatGPT users report they are more likely to use ChatGPT over directly asking professors or attendings (45.3%), textbooks (42.2%), and lectures (31.7%), and least likely to be used over popular flashcard application Anki (11.1%) and medical education videos (9.5%).
    CONCLUSIONS: ChatGPT is an increasingly popular resource among medical students, with many preferring ChatGPT over other traditional resources such as professors, textbooks, and lectures. Its impact on medical education will only continue to grow as its capabilities improve.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:在美国,五分之一的成年人目前是患有严重疾病或残疾的个人的家庭照顾者。与专业护理人员不同,家庭照顾者通常在没有正式准备或培训的情况下承担这一角色。因此,迫切需要提高家庭护理人员提供优质护理的能力。利用技术作为教育工具或辅助护理是一种有前途的方法,有可能提高家庭护理人员的学习和护理能力。大型语言模型(LLM)可以用作支持护理人员的基础技术。LLM可以归类为基础模型(FM),它是在广泛的数据集上训练的大规模模型,可以适应一系列不同的领域任务。尽管有潜力,FM有“幻觉”的关键弱点,“模型产生的信息可能具有误导性或不准确。当语言模型被部署为护理人员的一线帮助工具时,信息可靠性至关重要。
    目的:本研究旨在(1)通过使用FM和护理知识库来开发可靠的护理语言模型(CaLM),(2)使用需要更少的计算资源的小型FM开发可访问的CaLM,(3)与大型调频相比,评估模型的性能。
    方法:我们使用检索增强生成(RAG)框架结合FM微调开发了一种CaLM,通过将模型基于护理知识库来提高FM答案的质量。CaLM的关键组成部分是护理知识库,微调调频,和一个回收模块。我们使用2个小型FM作为CaLM(LLaMA[大型语言模型MetaAI]2和Falcon,具有70亿个参数)的基础,并采用了大型FM(GPT-3.5,估计有1750亿个参数)作为基准。我们通过从互联网上收集各种类型的文档来开发护理知识库。我们专注于阿尔茨海默病和相关痴呆症患者的护理人员。我们使用通常用于评估语言模型的基准指标及其可靠性来评估模型的性能,以提供准确的答案参考。
    结果:RAG框架提高了本研究中使用的所有FM在所有措施中的性能。不出所料,在所有指标上,大型FM的表现都优于小型FM。有趣的是,在所有指标中,使用RAG的小型微调FM的表现明显优于GPT3.5。具有小FM的微调LLaMA2在返回带有答案的参考方面比GPT3.5(即使使用RAG)表现更好。
    结论:研究表明,可以使用具有特定于护理领域的知识库的小型FM开发可靠且可访问的CaLM。
    BACKGROUND: In the United States, 1 in 5 adults currently serves as a family caregiver for an individual with a serious illness or disability. Unlike professional caregivers, family caregivers often assume this role without formal preparation or training. Thus, there is an urgent need to enhance the capacity of family caregivers to provide quality care. Leveraging technology as an educational tool or an adjunct to care is a promising approach that has the potential to enhance the learning and caregiving capabilities of family caregivers. Large language models (LLMs) can potentially be used as a foundation technology for supporting caregivers. An LLM can be categorized as a foundation model (FM), which is a large-scale model trained on a broad data set that can be adapted to a range of different domain tasks. Despite their potential, FMs have the critical weakness of \"hallucination,\" where the models generate information that can be misleading or inaccurate. Information reliability is essential when language models are deployed as front-line help tools for caregivers.
    OBJECTIVE: This study aimed to (1) develop a reliable caregiving language model (CaLM) by using FMs and a caregiving knowledge base, (2) develop an accessible CaLM using a small FM that requires fewer computing resources, and (3) evaluate the model\'s performance compared with a large FM.
    METHODS: We developed a CaLM using the retrieval augmented generation (RAG) framework combined with FM fine-tuning for improving the quality of FM answers by grounding the model on a caregiving knowledge base. The key components of the CaLM are the caregiving knowledge base, a fine-tuned FM, and a retriever module. We used 2 small FMs as candidates for the foundation of the CaLM (LLaMA [large language model Meta AI] 2 and Falcon with 7 billion parameters) and adopted a large FM (GPT-3.5 with an estimated 175 billion parameters) as a benchmark. We developed the caregiving knowledge base by gathering various types of documents from the internet. We focused on caregivers of individuals with Alzheimer disease and related dementias. We evaluated the models\' performances using the benchmark metrics commonly used in evaluating language models and their reliability for providing accurate references with their answers.
    RESULTS: The RAG framework improved the performance of all FMs used in this study across all measures. As expected, the large FM performed better than the small FMs across all metrics. Interestingly, the small fine-tuned FMs with RAG performed significantly better than GPT 3.5 across all metrics. The fine-tuned LLaMA 2 with a small FM performed better than GPT 3.5 (even with RAG) in returning references with the answers.
    CONCLUSIONS: The study shows that a reliable and accessible CaLM can be developed using small FMs with a knowledge base specific to the caregiving domain.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    信使RNA(mRNAs)的亚细胞定位是生物分子的一个关键方面,与基因调控和蛋白质合成紧密相连,并为生物医学领域的疾病诊断和药物开发提供创新见解。已经提出了几种计算方法来预测细胞内mRNA的亚细胞定位。然而,这些预测的准确性仍然存在缺陷。在这项研究中,我们提出了一种基于梯度提升树算法的mRCat预测器,专门用于预测mRNA是否位于细胞核或细胞质中。该预测器首先使用大型语言模型来彻底探索序列中的隐藏信息,然后整合传统的序列特征来共同表征mRNA基因序列。最后,它采用CatBoost作为预测mRNA亚细胞定位的基础分类器。对独立测试集的实验验证表明,mRCat的准确性为0.761,F1评分为0.710,MCC为0.511,AUROC为0.751。结果表明,与其他最先进的方法相比,我们的方法具有更高的准确性和鲁棒性。预计将为生物分子研究提供深刻的见解。
    The subcellular localization of messenger RNAs (mRNAs) is a pivotal aspect of biomolecules, tightly linked to gene regulation and protein synthesis, and offers innovative insights into disease diagnosis and drug development in the field of biomedicine. Several computational methods have been proposed to predict the subcellular localization of mRNAs within cells. However, there remains a deficiency in the accuracy of these predictions. In this study, we propose an mRCat predictor based on the gradient boosting tree algorithm specifically to predict whether mRNAs are localized in the nucleus or in the cytoplasm. This predictor firstly uses large language models to thoroughly explore hidden information within sequences and then integrates traditional sequence features to collectively characterize mRNA gene sequences. Finally, it employs CatBoost as the base classifier for predicting the subcellular localization of mRNAs. The experimental validation on an independent test set demonstrates that mRCat obtained accuracy of 0.761, F1 score of 0.710, MCC of 0.511, and AUROC of 0.751. The results indicate that our method has higher accuracy and robustness compared to other state-of-the-art methods. It is anticipated to offer deep insights for biomolecular research.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号