Large language models

大型语言模型
  • 文章类型: Journal Article
    大型语言模型(LLM)在临床信息处理中起着至关重要的作用,展示跨不同语言任务的强大概括。然而,现有LLM,尽管意义重大,缺乏临床应用的优化,在幻想和可解释性方面提出挑战。检索增强生成(RAG)模型通过提供答案生成的来源来解决这些问题,从而减少错误。本研究探讨RAG技术在临床胃肠病学中的应用,以增强对胃肠道疾病的知识生成。
    我们使用由25个胃肠道疾病指南组成的语料库对嵌入模型进行了微调。与基础模型相比,微调模型的命中率提高了18%,gte-base-zh.此外,它的性能优于OpenAI的嵌入模型20%。使用带有骆驼索引的RAG框架,我们开发了一个中国胃肠病学聊天机器人,名为“胃机器人”,“这显著提高了答案的准确性和上下文相关性,最大限度地减少错误和传播误导性信息的风险。
    在使用RAGAS框架评估GastroBot时,我们观察到95%的上下文召回率。对源头的忠诚,为93.73%。答案的相关性表现出很强的相关性,达到92.28%。这些发现强调了GastroBot在提供有关胃肠道疾病的准确和上下文相关信息方面的有效性。在对GastroBot进行手动评估期间,与其他型号相比,我们的GastroBot模型提供了大量有价值的知识,同时确保结果的完整性和一致性。
    研究结果表明,将RAG方法纳入临床胃肠病学可以增强大型语言模型的准确性和可靠性。作为该方法的实际实现,GastroBot在上下文理解和响应质量方面表现出显着增强。模型的不断探索和完善有望推动胃肠病学领域的临床信息处理和决策支持。
    UNASSIGNED: Large Language Models (LLMs) play a crucial role in clinical information processing, showcasing robust generalization across diverse language tasks. However, existing LLMs, despite their significance, lack optimization for clinical applications, presenting challenges in terms of illusions and interpretability. The Retrieval-Augmented Generation (RAG) model addresses these issues by providing sources for answer generation, thereby reducing errors. This study explores the application of RAG technology in clinical gastroenterology to enhance knowledge generation on gastrointestinal diseases.
    UNASSIGNED: We fine-tuned the embedding model using a corpus consisting of 25 guidelines on gastrointestinal diseases. The fine-tuned model exhibited an 18% improvement in hit rate compared to its base model, gte-base-zh. Moreover, it outperformed OpenAI\'s Embedding model by 20%. Employing the RAG framework with the llama-index, we developed a Chinese gastroenterology chatbot named \"GastroBot,\" which significantly improves answer accuracy and contextual relevance, minimizing errors and the risk of disseminating misleading information.
    UNASSIGNED: When evaluating GastroBot using the RAGAS framework, we observed a context recall rate of 95%. The faithfulness to the source, stands at 93.73%. The relevance of answers exhibits a strong correlation, reaching 92.28%. These findings highlight the effectiveness of GastroBot in providing accurate and contextually relevant information about gastrointestinal diseases. During manual assessment of GastroBot, in comparison with other models, our GastroBot model delivers a substantial amount of valuable knowledge while ensuring the completeness and consistency of the results.
    UNASSIGNED: Research findings suggest that incorporating the RAG method into clinical gastroenterology can enhance the accuracy and reliability of large language models. Serving as a practical implementation of this method, GastroBot has demonstrated significant enhancements in contextual comprehension and response quality. Continued exploration and refinement of the model are poised to drive forward clinical information processing and decision support in the gastroenterology field.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    大型语言模型(LLM)最近引起了广泛的关注,然而,它们在专业科学领域的应用仍然需要深度适应。在这项工作中,我们通过将GPT-4模型与训练有素的机器学习(ML)算法集成在一起,为有机场效应晶体管(OFET)设计了人工智能(AI)代理。它可以有效地从科学文献中提取OFET的实验参数,并将其重塑为结构化数据库,准确率和召回率均超过92%。结合训练有素的机器学习模型,此AI代理可以进一步为设备设计提供有针对性的指导和建议。通过及时的工程和人在圈策略,该代理从不同出版商的277篇研究文章中提取了709个OFET的足够信息,并将其收集到包含10000多个设备参数的标准化数据库中。使用这个数据库,我们训练了一个基于XGBoost的ML模型,用于设备性能判断。结合高精度模型的解释,该试剂提供了可行的优化方案,该方案使DP-DTTOFET的电荷传输性能提高了三倍。我们的工作是LLM在有机光电器件领域的有效实践,并扩展了有机光电材料和器件的研究范式。本文受版权保护。保留所有权利。
    Large language models (LLMs) have attracted widespread attention recently, however, their application in specialized scientific fields still requires deep adaptation. Here, an artificial intelligence (AI) agent for organic field-effect transistors (OFETs) is designed by integrating the generative pre-trained transformer 4 (GPT-4) model with well-trained machine learning (ML) algorithms. It can efficiently extract the experimental parameters of OFETs from scientific literature and reshape them into a structured database, achieving precision and recall rates both exceeding 92%. Combined with well-trained ML models, this AI agent can further provide targeted guidance and suggestions for device design. With prompt engineering and human-in-loop strategies, the agent extracts sufficient information of 709 OFETs from 277 research articles across different publishers and gathers them into a standardized database containing more than 10 000 device parameters. Using this database, a ML model based on Extreme Gradient Boosting is trained for device performance judgment. Combined with the interpretation of the high-precision model, the agent has provided a feasible optimization scheme that has tripled the charge transport properties of 2,6-diphenyldithieno[3,2-b:2\',3\'-d]thiophene OFETs. This work is an effective practice of LLMs in the field of organic optoelectronic devices and expands the research paradigm of organic optoelectronic materials and devices.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    目的:生物医学知识图谱在各种生物医学研究领域中发挥着关键作用。同时,术语聚类成为构建这些知识图的关键步骤,旨在识别同义词。由于缺乏知识,先前使用统一医学语言系统(UMLS)同义词训练的对比学习模型在对困难术语进行聚类方面存在困难,并且不能很好地超越UMLS术语。在这项工作中,我们利用来自大型语言模型(LLM)的世界知识,并提出通过解释表示术语的对比学习(CoRTEx)来增强术语表示并显着改善术语聚类。
    方法:模型训练涉及使用ChatGPT为UMLS术语的清洁子集生成解释。我们采用对比学习,同时考虑术语嵌入和解释嵌入,并逐步引入硬阴性样本。此外,设计了一种ChatGPT辅助的BIRCH算法,用于对新的本体进行高效聚类。
    结果:我们建立了聚类测试集和硬阴性测试集,我们的模型始终达到最高的F1得分。使用CoRTEx嵌入和改进的BIRCH算法,我们将来自生物医学信息学本体系统(BIOS)的35580932个术语分组为22104559个集群,并向ChatGPT进行O(N)查询。案例研究突出了模型在处理具有挑战性的样本方面的功效,由解释中的信息帮助。
    结论:通过将术语与它们的解释对齐,CoRTEx展示了优于基准模型的准确性和超出其训练集的鲁棒性,它适用于大规模生物医学本体的聚类术语。
    OBJECTIVE: Biomedical Knowledge Graphs play a pivotal role in various biomedical research domains. Concurrently, term clustering emerges as a crucial step in constructing these knowledge graphs, aiming to identify synonymous terms. Due to a lack of knowledge, previous contrastive learning models trained with Unified Medical Language System (UMLS) synonyms struggle at clustering difficult terms and do not generalize well beyond UMLS terms. In this work, we leverage the world knowledge from large language models (LLMs) and propose Contrastive Learning for Representing Terms via Explanations (CoRTEx) to enhance term representation and significantly improves term clustering.
    METHODS: The model training involves generating explanations for a cleaned subset of UMLS terms using ChatGPT. We employ contrastive learning, considering term and explanation embeddings simultaneously, and progressively introduce hard negative samples. Additionally, a ChatGPT-assisted BIRCH algorithm is designed for efficient clustering of a new ontology.
    RESULTS: We established a clustering test set and a hard negative test set, where our model consistently achieves the highest F1 score. With CoRTEx embeddings and the modified BIRCH algorithm, we grouped 35 580 932 terms from the Biomedical Informatics Ontology System (BIOS) into 22 104 559 clusters with O(N) queries to ChatGPT. Case studies highlight the model\'s efficacy in handling challenging samples, aided by information from explanations.
    CONCLUSIONS: By aligning terms to their explanations, CoRTEx demonstrates superior accuracy over benchmark models and robustness beyond its training set, and it is suitable for clustering terms for large-scale biomedical ontologies.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    目标:大型语言模型(LLM),例如ChatGPT和Med-PaLM,在各种医学问答任务中都表现出色。然而,这些以英语为中心的模型在非英语临床环境中遇到挑战,主要是由于各自语言的临床知识有限,训练语料库不平衡的结果。我们系统地评估了中国医学背景下的LLM,并开发了一种新颖的背景学习框架来提高他们的表现。
    方法:最新的中国国家医学执业资格考试(CNMLE-2022)作为基准。我们收集了53种医学书籍和381.149种医学问题,以构建医学知识库和题库。拟议的知识和少量增强上下文学习(KFE)框架利用LLM的上下文学习能力来整合各种外部临床知识源。我们用ChatGPT(GPT-3.5)评估了KFE,GPT-4,百川2-7B,百川2-13B,和QWEN-72B在CNMLE-2022中,从7个不同的角度进一步研究了不同途径将LLM与医学知识相结合的有效性。
    结果:直接应用ChatGPT未能获得CNMLE-2022的资格,得分为51。与KFE框架合作,不同大小的LLM产生了一致和显著的改进。ChatGPT的表现飙升至70.04,GPT-4的最高得分为82.59。这超过了资格阈值(60)并且超过了68.70的平均人类得分,确认了该框架的有效性和鲁棒性。它还使较小的百川2-13B通过了考试,展示了低资源环境中的巨大潜力。
    结论:这项研究揭示了在非英语医疗场景中增强LLM能力的最佳实践。通过上下文学习协同医学知识,LLM可以将临床洞察力扩展到医疗保健中的语言障碍之外,显着减少LLM应用程序的语言相关差异,并确保该领域的全球利益。
    OBJECTIVE: Large Language Models (LLMs) such as ChatGPT and Med-PaLM have excelled in various medical question-answering tasks. However, these English-centric models encounter challenges in non-English clinical settings, primarily due to limited clinical knowledge in respective languages, a consequence of imbalanced training corpora. We systematically evaluate LLMs in the Chinese medical context and develop a novel in-context learning framework to enhance their performance.
    METHODS: The latest China National Medical Licensing Examination (CNMLE-2022) served as the benchmark. We collected 53 medical books and 381 149 medical questions to construct the medical knowledge base and question bank. The proposed Knowledge and Few-shot Enhancement In-context Learning (KFE) framework leverages the in-context learning ability of LLMs to integrate diverse external clinical knowledge sources. We evaluated KFE with ChatGPT (GPT-3.5), GPT-4, Baichuan2-7B, Baichuan2-13B, and QWEN-72B in CNMLE-2022 and further investigated the effectiveness of different pathways for incorporating LLMs with medical knowledge from 7 distinct perspectives.
    RESULTS: Directly applying ChatGPT failed to qualify for the CNMLE-2022 at a score of 51. Cooperated with the KFE framework, the LLMs with varying sizes yielded consistent and significant improvements. The ChatGPT\'s performance surged to 70.04 and GPT-4 achieved the highest score of 82.59. This surpasses the qualification threshold (60) and exceeds the average human score of 68.70, affirming the effectiveness and robustness of the framework. It also enabled a smaller Baichuan2-13B to pass the examination, showcasing the great potential in low-resource settings.
    CONCLUSIONS: This study shed light on the optimal practices to enhance the capabilities of LLMs in non-English medical scenarios. By synergizing medical knowledge through in-context learning, LLMs can extend clinical insight beyond language barriers in healthcare, significantly reducing language-related disparities of LLM applications and ensuring global benefit in this field.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    亚洲和黑人男女的刻板印象如何?从性别种族和刻板印象内容的角度进行的研究产生了混合的实证结果。使用在英语书籍上预先训练的BERT模型,新闻文章,维基百科,Reddit和Twitter使用一种测量自然语言命题的新方法(填充-掩模关联测试,FMAT),我们探索了性别(男性气质-女性气质),体力,关于亚洲和黑人男性和女性的刻板印象的温暖和能力内容。我们发现,亚洲男性(但不是女性)的刻板印象比黑人男性少,道德/值得信赖。与黑人男性和黑人女性相比,分别,亚洲男性和亚洲女性都被定型为肌肉/运动能力较弱,自信/占主导地位,但更善于交际/友好,更有能力/更聪明。这些发现表明,亚洲和黑人对自然语言的刻板印象具有多方面的内容和性别差异,需要一个平衡的观点,整合性别图式理论和刻板印象内容模型。探索它们的语义表示作为大型语言模型中的命题,这项研究揭示了种族性别观念在现实生活中的自然表达。
    How are Asian and Black men and women stereotyped? Research from the gendered race and stereotype content perspectives has produced mixed empirical findings. Using BERT models pre-trained on English language books, news articles, Wikipedia, Reddit and Twitter, with a new method for measuring propositions in natural language (the Fill-Mask Association Test, FMAT), we explored the gender (masculinity-femininity), physical strength, warmth and competence contents of stereotypes about Asian and Black men and women. We find that Asian men (but not women) are stereotyped as less masculine and less moral/trustworthy than Black men. Compared to Black men and Black women, respectively, both Asian men and Asian women are stereotyped as less muscular/athletic and less assertive/dominant, but more sociable/friendly and more capable/intelligent. These findings suggest that Asian and Black stereotypes in natural language have multifaceted contents and gender nuances, requiring a balanced view integrating the gender schema theory and the stereotype content model. Exploring their semantic representations as propositions in large language models, this research reveals how intersectional race-gender stereotypes are naturally expressed in real life.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    背景:淀粉样变,罕见的多系统条件,通常需要复杂的,多学科护理。其低患病率强调了努力确保高质量患者教育材料的可用性以获得更好的结果的重要性。ChatGPT(OpenAI)是一个由人工智能提供支持的大型语言模型,为传播准确、可靠,以及为患者和提供者提供的可访问教育资源。其友好的用户界面,引人入胜的对话回应,以及用户提出后续问题的能力使其成为向患者提供准确和量身定制的信息的有前途的未来工具。
    目的:我们对准确性进行了多学科评估,再现性,ChatGPT在回答与淀粉样变性有关的问题时的可读性。
    方法:总共,与心脏病学相关的98个淀粉样变性问题,胃肠病学,神经学是由医学协会策划的,机构,和淀粉样变性Facebook支持小组,并输入ChatGPT-3.5和ChatGPT-4。心脏病学和胃肠病学相关的反应由董事会认证的心脏病学家和胃肠病学家独立评分。分别,专门研究淀粉样变性的人.这两位审稿人(RG和DCK)还对一般性问题进行了评分,并通过讨论解决了分歧。与神经病学相关的反应由专门从事淀粉样变性的董事会认证的神经科医生(AAH)进行分级。评审人员使用了以下分级量表:(1)综合,(2)正确但不充分,(3)有些正确,有些不正确,(4)完全不正确。问题按类别分层,以便进一步分析。通过将每个问题输入每个模型两次来评估重复性。还使用Python中的Textstat库(Python软件基金会)和R软件中的Textstat可读性包(R统计计算基金会)评估了ChatGPT-4响应的可读性。
    结果:ChatGPT-4(n=98)提供了93(95%)的准确信息,和82(84%)是全面的。ChatGPT-3.5(n=83)提供了74(89%)的准确信息响应,66(79%)是全面的。按问题类别检查时,ChatGTP-4和ChatGPT-3.5提供了53(95%)和48(86%)的综合响应,分别,到“一般问题”(n=56)。当受试者检查时,ChatGPT-4和ChatGPT-3.5在回答心脏病学问题方面表现最佳(n=12),两种模型均产生10个(83%)综合反应。对于胃肠病学(n=15),ChatGPT-4收到了9个(60%)答复的综合成绩,和ChatGPT-3.5提供8(53%)的反应。总的来说,ChatGPT-4的98个应答中的96个(98%)和ChatGPT-3.5的83个应答中的73个(88%)是可再现的。ChatGPT-4响应的可读性范围从10年级到美国研究生年级以上,平均为15.5(SD1.9)。
    结论:大型语言模型是为淀粉样变性患者提供准确可靠的健康信息的有前途的工具。然而,ChatGPT的回答超过了美国医学会推荐的五至六年级阅读水平。未来的研究重点是提高反应的准确性和可读性是必要的。在广泛实施之前,该技术的局限性和伦理影响必须进一步探讨,以确保患者安全和公平实施。
    BACKGROUND: Amyloidosis, a rare multisystem condition, often requires complex, multidisciplinary care. Its low prevalence underscores the importance of efforts to ensure the availability of high-quality patient education materials for better outcomes. ChatGPT (OpenAI) is a large language model powered by artificial intelligence that offers a potential avenue for disseminating accurate, reliable, and accessible educational resources for both patients and providers. Its user-friendly interface, engaging conversational responses, and the capability for users to ask follow-up questions make it a promising future tool in delivering accurate and tailored information to patients.
    OBJECTIVE: We performed a multidisciplinary assessment of the accuracy, reproducibility, and readability of ChatGPT in answering questions related to amyloidosis.
    METHODS: In total, 98 amyloidosis questions related to cardiology, gastroenterology, and neurology were curated from medical societies, institutions, and amyloidosis Facebook support groups and inputted into ChatGPT-3.5 and ChatGPT-4. Cardiology- and gastroenterology-related responses were independently graded by a board-certified cardiologist and gastroenterologist, respectively, who specialize in amyloidosis. These 2 reviewers (RG and DCK) also graded general questions for which disagreements were resolved with discussion. Neurology-related responses were graded by a board-certified neurologist (AAH) who specializes in amyloidosis. Reviewers used the following grading scale: (1) comprehensive, (2) correct but inadequate, (3) some correct and some incorrect, and (4) completely incorrect. Questions were stratified by categories for further analysis. Reproducibility was assessed by inputting each question twice into each model. The readability of ChatGPT-4 responses was also evaluated using the Textstat library in Python (Python Software Foundation) and the Textstat readability package in R software (R Foundation for Statistical Computing).
    RESULTS: ChatGPT-4 (n=98) provided 93 (95%) responses with accurate information, and 82 (84%) were comprehensive. ChatGPT-3.5 (n=83) provided 74 (89%) responses with accurate information, and 66 (79%) were comprehensive. When examined by question category, ChatGTP-4 and ChatGPT-3.5 provided 53 (95%) and 48 (86%) comprehensive responses, respectively, to \"general questions\" (n=56). When examined by subject, ChatGPT-4 and ChatGPT-3.5 performed best in response to cardiology questions (n=12) with both models producing 10 (83%) comprehensive responses. For gastroenterology (n=15), ChatGPT-4 received comprehensive grades for 9 (60%) responses, and ChatGPT-3.5 provided 8 (53%) responses. Overall, 96 of 98 (98%) responses for ChatGPT-4 and 73 of 83 (88%) for ChatGPT-3.5 were reproducible. The readability of ChatGPT-4\'s responses ranged from 10th to beyond graduate US grade levels with an average of 15.5 (SD 1.9).
    CONCLUSIONS: Large language models are a promising tool for accurate and reliable health information for patients living with amyloidosis. However, ChatGPT\'s responses exceeded the American Medical Association\'s recommended fifth- to sixth-grade reading level. Future studies focusing on improving response accuracy and readability are warranted. Prior to widespread implementation, the technology\'s limitations and ethical implications must be further explored to ensure patient safety and equitable implementation.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Editorial
    社交媒体的使用问题对个人的日常生活产生了许多负面影响,人际关系,身心健康,还有更多.目前,很少有方法和工具来缓解有问题的社交媒体,他们的潜力尚未充分发挥。新兴的大型语言模型(LLM)在为人们提供信息和帮助方面变得越来越流行,并被应用于生活的许多方面。在减轻有问题的社交媒体使用方面,ChatGPT等LLM可以通过充当用户的对话合作伙伴和网点来发挥积极作用,提供个性化的信息和资源,监控和干预有问题的社交媒体使用,还有更多.在这个过程中,我们应该认识到ChatGPT等LLM的巨大潜力和无限可能性,利用他们的优势更好地解决有问题的社交媒体使用问题,同时也承认ChatGPT技术的局限性和潜在的陷阱,如错误,问题解决的限制,隐私和安全问题,和潜在的过度依赖。当我们利用LLM的优势来解决社交媒体使用中的问题时,我们必须采取谨慎和道德的态度,警惕LLM在解决有问题的社交媒体使用方面可能产生的潜在不利影响,以更好地利用技术为个人和社会服务。
    The problematic use of social media has numerous negative impacts on individuals\' daily lives, interpersonal relationships, physical and mental health, and more. Currently, there are few methods and tools to alleviate problematic social media, and their potential is yet to be fully realized. Emerging large language models (LLMs) are becoming increasingly popular for providing information and assistance to people and are being applied in many aspects of life. In mitigating problematic social media use, LLMs such as ChatGPT can play a positive role by serving as conversational partners and outlets for users, providing personalized information and resources, monitoring and intervening in problematic social media use, and more. In this process, we should recognize both the enormous potential and endless possibilities of LLMs such as ChatGPT, leveraging their advantages to better address problematic social media use, while also acknowledging the limitations and potential pitfalls of ChatGPT technology, such as errors, limitations in issue resolution, privacy and security concerns, and potential overreliance. When we leverage the advantages of LLMs to address issues in social media usage, we must adopt a cautious and ethical approach, being vigilant of the potential adverse effects that LLMs may have in addressing problematic social media use to better harness technology to serve individuals and society.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    目标:最近,大型语言模型(LLM)在自然语言理解方面展示了卓越的能力。在展示日常对话和问答(QA)情况的熟练程度时,这些模型经常在需要精度的领域中挣扎,如医疗应用,由于他们缺乏特定领域的知识。在这篇文章中,我们描述了建造一个强大的,专为医学应用而设计的开源语言模型,称为PMC-LLaMA。
    方法:我们将通用LLM调整为医学领域,通过整合4.8M生物医学学术论文和30K医学教科书,涉及以数据为中心的知识注入,以及全面的特定领域指令微调,包括医疗QA,推理的理由,和对话对话与202M令牌。
    结果:在评估各种公共医疗QA基准和手动评级时,我们的轻量级PMC-LLaMA,仅由13B个参数组成,表现出优越的性能,甚至超过了ChatGPT.所有型号,代码,和调整指令的数据集将发布给研究界。
    结论:我们的贡献是3倍:(1)我们建立了一个面向医学领域的开源LLM。我们相信提出的PMC-LLaMA模型可以促进医学基础模型的进一步发展,作为医学训练的基本生成语言骨干;(2)我们进行彻底的消融研究,以证明每个建议组件的有效性,展示不同的训练数据和模型尺度如何影响医学LLM;(3)我们贡献了大规模,用于指令调整的综合数据集。
    结论:在本文中,我们系统地研究了建立开源医疗专用LLM的过程,PMC-LLaMA.
    OBJECTIVE: Recently, large language models (LLMs) have showcased remarkable capabilities in natural language understanding. While demonstrating proficiency in everyday conversations and question-answering (QA) situations, these models frequently struggle in domains that require precision, such as medical applications, due to their lack of domain-specific knowledge. In this article, we describe the procedure for building a powerful, open-source language model specifically designed for medicine applications, termed as PMC-LLaMA.
    METHODS: We adapt a general-purpose LLM toward the medical domain, involving data-centric knowledge injection through the integration of 4.8M biomedical academic papers and 30K medical textbooks, as well as comprehensive domain-specific instruction fine-tuning, encompassing medical QA, rationale for reasoning, and conversational dialogues with 202M tokens.
    RESULTS: While evaluating various public medical QA benchmarks and manual rating, our lightweight PMC-LLaMA, which consists of only 13B parameters, exhibits superior performance, even surpassing ChatGPT. All models, codes, and datasets for instruction tuning will be released to the research community.
    CONCLUSIONS: Our contributions are 3-fold: (1) we build up an open-source LLM toward the medical domain. We believe the proposed PMC-LLaMA model can promote further development of foundation models in medicine, serving as a medical trainable basic generative language backbone; (2) we conduct thorough ablation studies to demonstrate the effectiveness of each proposed component, demonstrating how different training data and model scales affect medical LLMs; (3) we contribute a large-scale, comprehensive dataset for instruction tuning.
    CONCLUSIONS: In this article, we systematically investigate the process of building up an open-source medical-specific LLM, PMC-LLaMA.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    背景:自2022年末ChatGPT发布以来,大型语言模型(LLM)已得到重视。
    目的:本研究的目的是评估ChatGPT(GPT-3.5)在两个不同的学术领域中引用和参考文献的准确性:自然科学和人文科学。
    方法:两名研究人员独立提示ChatGPT为手稿撰写介绍部分并包括引文;然后他们评估了引文和数字对象标识符(DOI)的准确性。对这两个学科的结果进行了比较。
    结果:包括十个主题,包括5个自然科学和5个人文科学。共引文102次,自然科学55人,人文科学47人。其中,自然科学中的40篇引文(72.7%)和人文科学中的36篇引文(76.6%)被证实存在(P=.42)。在自然科学(39/55,70.9%)和人文学科(18/47,38.3%)的DOI存在显着差异,两个学科之间的准确性存在显着差异(18/55,32.7%vs4/47,8.5%)。DOI幻觉在人文学科中更为普遍(42/55,89.4%)。人文科学中的Levenshtein距离明显高于自然科学,反映DOI精度较低。
    结论:ChatGPT在产生引文和参考文献方面的表现因学科而异。DOI标准和学科细微差别的差异导致了性能差异。研究人员应考虑人工智能写作工具在引文准确性方面的优势和局限性。特定于域的模型的使用可以提高准确性。
    BACKGROUND: Large language models (LLMs) have gained prominence since the release of ChatGPT in late 2022.
    OBJECTIVE: The aim of this study was to assess the accuracy of citations and references generated by ChatGPT (GPT-3.5) in two distinct academic domains: the natural sciences and humanities.
    METHODS: Two researchers independently prompted ChatGPT to write an introduction section for a manuscript and include citations; they then evaluated the accuracy of the citations and Digital Object Identifiers (DOIs). Results were compared between the two disciplines.
    RESULTS: Ten topics were included, including 5 in the natural sciences and 5 in the humanities. A total of 102 citations were generated, with 55 in the natural sciences and 47 in the humanities. Among these, 40 citations (72.7%) in the natural sciences and 36 citations (76.6%) in the humanities were confirmed to exist (P=.42). There were significant disparities found in DOI presence in the natural sciences (39/55, 70.9%) and the humanities (18/47, 38.3%), along with significant differences in accuracy between the two disciplines (18/55, 32.7% vs 4/47, 8.5%). DOI hallucination was more prevalent in the humanities (42/55, 89.4%). The Levenshtein distance was significantly higher in the humanities than in the natural sciences, reflecting the lower DOI accuracy.
    CONCLUSIONS: ChatGPT\'s performance in generating citations and references varies across disciplines. Differences in DOI standards and disciplinary nuances contribute to performance variations. Researchers should consider the strengths and limitations of artificial intelligence writing tools with respect to citation accuracy. The use of domain-specific models may enhance accuracy.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:疾病进展的研究依赖于临床数据,包括文本数据,从文本数据中提取有价值的特征一直是研究热点。随着大型语言模型(LLM)的兴起,基于语义的提取管道在临床研究中得到了认可。然而,LLM的安全性和功能幻觉问题需要进一步关注。
    目的:本研究旨在引入一种新颖的模块化LLM管道,它可以从文本患者入院记录中语义地提取特征。
    方法:管道设计用于处理概念提取的系统继承,聚合,问题产生,语料库提取,和问答量表提取,通过2个低参数LLM测试:Qwen-14B-Chat(QWEN)和Baichu2-13B-Chat(BAICHUAN)。广西壮族自治区人民医院妊娠病例25,709例,中国,在本地专家注释的帮助下用于评估。用准确性和精密度的指标对管道进行了评估,零比,时间消耗。此外,我们通过消费级GPU上的Qwen-14B-Chat的量化版本评估了其性能。
    结果:管道展示了特征提取的高精度,Qwen-14B-Chat的准确性和精密度结果(95.52%和92.93%,分别)和百川2-13B-聊天(95.86%和90.08%,分别)。此外,管道表现出低的零比和可变的时间消耗。QWEN的INT4量化版本提供了增强的性能,准确率为97.28%,零比为0%。
    结论:管道在不同的LLM中表现出一致的性能,并从文本数据中有效地提取临床特征。它还在消费级硬件上显示出可靠的性能。这种方法为从文本记录中挖掘临床研究数据提供了可行且有效的解决方案。
    BACKGROUND: The study of disease progression relies on clinical data, including text data, and extracting valuable features from text data has been a research hot spot. With the rise of large language models (LLMs), semantic-based extraction pipelines are gaining acceptance in clinical research. However, the security and feature hallucination issues of LLMs require further attention.
    OBJECTIVE: This study aimed to introduce a novel modular LLM pipeline, which could semantically extract features from textual patient admission records.
    METHODS: The pipeline was designed to process a systematic succession of concept extraction, aggregation, question generation, corpus extraction, and question-and-answer scale extraction, which was tested via 2 low-parameter LLMs: Qwen-14B-Chat (QWEN) and Baichuan2-13B-Chat (BAICHUAN). A data set of 25,709 pregnancy cases from the People\'s Hospital of Guangxi Zhuang Autonomous Region, China, was used for evaluation with the help of a local expert\'s annotation. The pipeline was evaluated with the metrics of accuracy and precision, null ratio, and time consumption. Additionally, we evaluated its performance via a quantified version of Qwen-14B-Chat on a consumer-grade GPU.
    RESULTS: The pipeline demonstrates a high level of precision in feature extraction, as evidenced by the accuracy and precision results of Qwen-14B-Chat (95.52% and 92.93%, respectively) and Baichuan2-13B-Chat (95.86% and 90.08%, respectively). Furthermore, the pipeline exhibited low null ratios and variable time consumption. The INT4-quantified version of QWEN delivered an enhanced performance with 97.28% accuracy and a 0% null ratio.
    CONCLUSIONS: The pipeline exhibited consistent performance across different LLMs and efficiently extracted clinical features from textual data. It also showed reliable performance on consumer-grade hardware. This approach offers a viable and effective solution for mining clinical research data from textual records.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号