Bidirectional Encoder Representations from Transformers

来自变压器的双向编码器表示
  • 文章类型: Journal Article
    背景:ICU再入院和出院后死亡率构成重大挑战。以前的研究使用EHR和机器学习模型,但主要集中在结构化数据上。护理记录包含关键的非结构化信息,但是它们的使用具有挑战性。自然语言处理(NLP)可以从临床文本中提取结构化特征。这项研究提出了关键护理描述提取器(CNDE)来预测ICU出院后的死亡率,并通过分析电子护理记录来识别计划外再入院的高风险患者。
    目的:开发了一种能够感知护理记录的深度神经网络(NurnaNet),结合生物临床医学预训练语言模型(BioClinicalBERT)分析MIMICIII数据集中的电子健康记录(EHR),以预测患者在6个月和2年内的死亡风险.
    方法:采用队列和系统开发设计。
    方法:基于从MIMIC-III中提取的数据,在2001年至2012年美国危重病数据库中,对结果进行了分析.
    方法:我们使用MIMIC数据集的入院时间和出生日期信息计算患者年龄。18岁以下或89岁以上的患者,或是死在医院的人,被排除在外。我们分析了ICU住院患者的16,973份护理记录。
    方法:我们开发了一种称为关键护理描述提取器(CNDE)的技术,从文本中提取关键内容。我们使用对数似然比来提取关键词并结合BioClinicalBERT。我们预测出院患者六个月和两年后的生存率,并使用精度评估模型的性能,召回,F1得分,接收器工作特性曲线(ROC曲线),曲线下面积(AUC),和精度-召回曲线(PR曲线)。
    结果:研究结果表明,NurnaNet在六个月和两年内获得了良好的F1得分(0.67030,0.70874)。与单独使用BioClinicalBERT相比,六个月和两年内的预测表现分别提高了2.05%和1.08%,分别。
    结论:CNDE可以有效减少长格式记录并提取关键内容。NurnaNet在分析护理记录数据方面具有良好的F1评分,这有助于识别患者出院后的死亡风险,并尽快调整相关医疗的定期随访和治疗计划。
    BACKGROUND: ICU readmissions and post-discharge mortality pose significant challenges. Previous studies used EHRs and machine learning models, but mostly focused on structured data. Nursing records contain crucial unstructured information, but their utilization is challenging. Natural language processing (NLP) can extract structured features from clinical text. This study proposes the Crucial Nursing Description Extractor (CNDE) to predict post-ICU discharge mortality rates and identify high-risk patients for unplanned readmission by analyzing electronic nursing records.
    OBJECTIVE: Developed a deep neural network (NurnaNet) with the ability to perceive nursing records, combined with a bio-clinical medicine pre-trained language model (BioClinicalBERT) to analyze the electronic health records (EHRs) in the MIMIC III dataset to predict the death of patients within six month and two year risk.
    METHODS: A cohort and system development design was used.
    METHODS: Based on data extracted from MIMIC-III, a database of critically ill in the US between 2001 and 2012, the results were analyzed.
    METHODS: We calculated patients\' age using admission time and date of birth information from the MIMIC dataset. Patients under 18 or over 89 years old, or who died in the hospital, were excluded. We analyzed 16,973 nursing records from patients\' ICU stays.
    METHODS: We have developed a technology called the Crucial Nursing Description Extractor (CNDE), which extracts key content from text. We use the logarithmic likelihood ratio to extract keywords and combine BioClinicalBERT. We predict the survival of discharged patients after six months and two years and evaluate the performance of the model using precision, recall, the F1-score, the receiver operating characteristic curve (ROC curve), the area under the curve (AUC), and the precision-recall curve (PR curve).
    RESULTS: The research findings indicate that NurnaNet achieved good F1-scores (0.67030, 0.70874) within six months and two years. Compared to using BioClinicalBERT alone, there was an improvement in performance of 2.05 % and 1.08 % for predictions within six months and two years, respectively.
    CONCLUSIONS: CNDE can effectively reduce long-form records and extract key content. NurnaNet has a good F1-score in analyzing the data of nursing records, which helps to identify the risk of death of patients after leaving the hospital and adjust the regular follow-up and treatment plan of relevant medical care as soon as possible.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    背景:在2023年,美国经历了最高记录的自杀人数,超过5万人死亡。在精神疾病领域,抑郁症是最常见的问题,影响15%至17%的人口,并携带约15%的显著自杀风险。然而,不是每个抑郁症患者都有自杀念头.虽然“自杀性抑郁症”不是临床诊断,它可以在日常生活中观察到,强调意识的必要性。
    目的:本研究旨在研究动力学,情感音调,以及r/Depressionsubreddit中帖子中讨论的主题,特别关注也参与过r/SuicideWatch社区的用户。目的是使用自然语言处理技术和模型来更好地了解具有潜在自杀意念的用户抑郁的复杂性,目的是改善自杀的干预和预防策略。
    方法:存档的帖子是从2019年到2022年的r/Depression和r/SuicideWatchReddit社区中提取的,最终数据集超过150,000个帖子,由大约25,000个独特的重叠用户提供。对这些职位进行了广泛和全面的混合方法,包括趋势和生存分析,探索2个子reddits中用户的动态。BERT系列模型从数据中提取特征用于情感和主题分析。
    结果:2020年8月16日,r/SuicideWatch的帖子计数超过了r/Depression。2020年从r/Depression到r/SuicideWatch的过渡时间最短,只持续了26天。悲伤是r/Depression社区中重叠用户中最普遍的情绪。此外,身体活动的变化,消极的自我观点,自杀念头被认为是最常见的抑郁症状,都与失望的情绪基调表现出较强的正相关。此外,除了自杀念头外,“在学校和工作中与抑郁和动机作斗争”(12%)成为讨论最多的话题,根据用户对自杀意念的倾向对用户进行分类。
    结论:我们的研究强调了在r/Depression和r/SuicideWatch等在线社区中使用自然语言处理技术探索与心理健康挑战相关的语言标记和模式的有效性。这些见解提供了不同于以往研究的新颖观点。在未来,使用这些技术将有可能进一步完善和优化机器分类,这可能导致更有效的干预和预防策略。
    BACKGROUND: In 2023, the United States experienced its highest- recorded number of suicides, exceeding 50,000 deaths. In the realm of psychiatric disorders, major depressive disorder stands out as the most common issue, affecting 15% to 17% of the population and carrying a notable suicide risk of approximately 15%. However, not everyone with depression has suicidal thoughts. While \"suicidal depression\" is not a clinical diagnosis, it may be observed in daily life, emphasizing the need for awareness.
    OBJECTIVE: This study aims to examine the dynamics, emotional tones, and topics discussed in posts within the r/Depression subreddit, with a specific focus on users who had also engaged in the r/SuicideWatch community. The objective was to use natural language processing techniques and models to better understand the complexities of depression among users with potential suicide ideation, with the goal of improving intervention and prevention strategies for suicide.
    METHODS: Archived posts were extracted from the r/Depression and r/SuicideWatch Reddit communities in English spanning from 2019 to 2022, resulting in a final data set of over 150,000 posts contributed by approximately 25,000 unique overlapping users. A broad and comprehensive mix of methods was conducted on these posts, including trend and survival analysis, to explore the dynamic of users in the 2 subreddits. The BERT family of models extracted features from data for sentiment and thematic analysis.
    RESULTS: On August 16, 2020, the post count in r/SuicideWatch surpassed that of r/Depression. The transition from r/Depression to r/SuicideWatch in 2020 was the shortest, lasting only 26 days. Sadness emerged as the most prevalent emotion among overlapping users in the r/Depression community. In addition, physical activity changes, negative self-view, and suicidal thoughts were identified as the most common depression symptoms, all showing strong positive correlations with the emotion tone of disappointment. Furthermore, the topic \"struggles with depression and motivation in school and work\" (12%) emerged as the most discussed topic aside from suicidal thoughts, categorizing users based on their inclination toward suicide ideation.
    CONCLUSIONS: Our study underscores the effectiveness of using natural language processing techniques to explore language markers and patterns associated with mental health challenges in online communities like r/Depression and r/SuicideWatch. These insights offer novel perspectives distinct from previous research. In the future, there will be potential for further refinement and optimization of machine classifications using these techniques, which could lead to more effective intervention and prevention strategies.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:人工智能(AI),更具体地说,大型语言模型(LLM),通过优化临床工作流程和提高决策质量,在彻底改变急诊护理提供方面具有巨大潜力。尽管将LLM整合到急诊医学(EM)中的热情正在增长,现有文献的特点是不同的个体研究集合,概念分析,和初步实施。鉴于这些复杂性和理解上的差距,需要一个有凝聚力的框架来理解现有的关于在EM中应用LLM的知识体系。
    目标:鉴于缺乏全面的框架来探索LLM在EM中的作用,本范围审查旨在系统地绘制有关EM中LLM的潜在应用的现有文献,并确定未来研究的方向。解决这一差距将有助于在实地取得知情进展。
    方法:使用PRISMA-ScR(系统审查的首选报告项目和范围审查的荟萃分析扩展)标准,我们搜索了OvidMEDLINE,Embase,WebofScience,和谷歌学者在2018年1月至2023年8月之间发表的论文中讨论了LLM在EM中的使用。我们排除了其他形式的AI。总共筛选了1994年的独特标题和摘要,每篇全文由2名作者独立审查。数据是独立提取的,5位作者对数据进行了定量和定性的协同合成。
    结果:共纳入43篇论文。研究主要从2022年到2023年,在美国和中国进行。我们发现了四个主要主题:(1)临床决策和支持被强调为关键领域,LLM在加强患者护理方面发挥着重要作用,特别是通过它们在实时分诊中的应用,允许早期识别患者的紧迫性;(2)效率,工作流,和信息管理证明了LLM显著提高运营效率的能力,特别是通过病人记录合成的自动化,这可以减轻行政负担,加强以患者为中心的护理;(3)风险,伦理,透明度被确定为关注领域,特别是关于LLM输出的可靠性,具体研究强调了在潜在有缺陷的训练数据集中确保无偏见决策的挑战,强调彻底验证和道德监督的重要性;(4)教育和沟通的可能性包括法学硕士丰富医学培训的能力,例如通过使用增强沟通技巧的模拟患者互动。
    结论:LLM有可能从根本上改变EM,加强临床决策,优化工作流,改善患者预后。这篇综述通过确定关键研究领域为未来的进步奠定了基础:LLM应用的前瞻性验证,建立负责任使用的标准,理解提供者和患者的看法,提高医生的人工智能素养。有效地将LLM集成到EM中需要协作努力和全面评估,以确保这些技术能够安全有效地应用。
    BACKGROUND: Artificial intelligence (AI), more specifically large language models (LLMs), holds significant potential in revolutionizing emergency care delivery by optimizing clinical workflows and enhancing the quality of decision-making. Although enthusiasm for integrating LLMs into emergency medicine (EM) is growing, the existing literature is characterized by a disparate collection of individual studies, conceptual analyses, and preliminary implementations. Given these complexities and gaps in understanding, a cohesive framework is needed to comprehend the existing body of knowledge on the application of LLMs in EM.
    OBJECTIVE: Given the absence of a comprehensive framework for exploring the roles of LLMs in EM, this scoping review aims to systematically map the existing literature on LLMs\' potential applications within EM and identify directions for future research. Addressing this gap will allow for informed advancements in the field.
    METHODS: Using PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews) criteria, we searched Ovid MEDLINE, Embase, Web of Science, and Google Scholar for papers published between January 2018 and August 2023 that discussed LLMs\' use in EM. We excluded other forms of AI. A total of 1994 unique titles and abstracts were screened, and each full-text paper was independently reviewed by 2 authors. Data were abstracted independently, and 5 authors performed a collaborative quantitative and qualitative synthesis of the data.
    RESULTS: A total of 43 papers were included. Studies were predominantly from 2022 to 2023 and conducted in the United States and China. We uncovered four major themes: (1) clinical decision-making and support was highlighted as a pivotal area, with LLMs playing a substantial role in enhancing patient care, notably through their application in real-time triage, allowing early recognition of patient urgency; (2) efficiency, workflow, and information management demonstrated the capacity of LLMs to significantly boost operational efficiency, particularly through the automation of patient record synthesis, which could reduce administrative burden and enhance patient-centric care; (3) risks, ethics, and transparency were identified as areas of concern, especially regarding the reliability of LLMs\' outputs, and specific studies highlighted the challenges of ensuring unbiased decision-making amidst potentially flawed training data sets, stressing the importance of thorough validation and ethical oversight; and (4) education and communication possibilities included LLMs\' capacity to enrich medical training, such as through using simulated patient interactions that enhance communication skills.
    CONCLUSIONS: LLMs have the potential to fundamentally transform EM, enhancing clinical decision-making, optimizing workflows, and improving patient outcomes. This review sets the stage for future advancements by identifying key research areas: prospective validation of LLM applications, establishing standards for responsible use, understanding provider and patient perceptions, and improving physicians\' AI literacy. Effective integration of LLMs into EM will require collaborative efforts and thorough evaluation to ensure these technologies can be safely and effectively applied.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    大型语言模型(LLM)是基于变压器的神经网络,可以对问题和指令提供类似人类的响应。LLM可以生成教育材料,总结文本,从自由文本中提取结构化数据,创建报告,写程序,并可能在注销时提供帮助。LLM与视觉模型相结合可以帮助解释组织病理学图像。LLM在改变病理学实践和教育方面具有巨大的潜力,但是这些模型并非万无一失,因此,任何人工智能生成的内容都必须使用信誉良好的来源进行验证。必须谨慎对待这些模型如何融入临床实践,因为这些模型会产生幻觉和不正确的结果,对人工智能的过度依赖可能会导致去技能和自动化偏见。这篇综述论文提供了LLM的简要历史,并重点介绍了LLM在病理学领域的几个用例。
    Large language models (LLMs) are transformer-based neural networks that can provide human-like responses to questions and instructions. LLMs can generate educational material, summarize text, extract structured data from free text, create reports, write programs, and potentially assist in case sign-out. LLMs combined with vision models can assist in interpreting histopathology images. LLMs have immense potential in transforming pathology practice and education, but these models are not infallible, so any artificial intelligence generated content must be verified with reputable sources. Caution must be exercised on how these models are integrated into clinical practice, as these models can produce hallucinations and incorrect results, and an over-reliance on artificial intelligence may lead to de-skilling and automation bias. This review paper provides a brief history of LLMs and highlights several use cases for LLMs in the field of pathology.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:在本文中,我们提出了一种自动的文章分类方法,利用大型语言模型(LLM)的强大功能。
    目的:本研究的目的是根据科学眼科论文的文本内容评估各种LLM的适用性。
    方法:我们开发了一种基于自然语言处理技术的模型,包括高级LLM,对科技论文的文本内容进行处理和分析。具体来说,我们使用零镜头学习LLM,并将双向和自回归变压器(BART)及其变体与来自变压器的双向编码器表示(BERT)及其变体进行了比较,比如distilbert,Scibert,PubmedBERT,Biobert要评估LLM,我们汇编了1000篇与眼部疾病相关的文章的数据集(视网膜疾病[RenD]),由6名专家组成的小组熟练地将其注释为19个不同的类别。除了文章的分类,我们还对不同的分类组进行了分析,以发现该领域的模式和趋势。
    结果:分类结果证明了LLM在没有人为干预的情况下对大量眼科论文进行分类的有效性。基于RenD数据集,该模型实现了0.86的平均准确度和0.85的平均F1得分。
    结论:所提出的框架在准确性和效率上都取得了显著的提高。它在眼科领域的应用展示了其知识组织和检索的潜力。我们进行了趋势分析,使研究人员和临床医生能够轻松地对相关论文进行分类和检索,在文献综述和信息收集以及不同学科中新兴科学趋势的识别方面节省时间和精力。此外,该模型在其他科学领域的可扩展性扩大了其在促进跨不同学科的研究和趋势分析方面的影响。
    BACKGROUND: In this paper, we present an automated method for article classification, leveraging the power of large language models (LLMs).
    OBJECTIVE: The aim of this study is to evaluate the applicability of various LLMs based on textual content of scientific ophthalmology papers.
    METHODS: We developed a model based on natural language processing techniques, including advanced LLMs, to process and analyze the textual content of scientific papers. Specifically, we used zero-shot learning LLMs and compared Bidirectional and Auto-Regressive Transformers (BART) and its variants with Bidirectional Encoder Representations from Transformers (BERT) and its variants, such as distilBERT, SciBERT, PubmedBERT, and BioBERT. To evaluate the LLMs, we compiled a data set (retinal diseases [RenD] ) of 1000 ocular disease-related articles, which were expertly annotated by a panel of 6 specialists into 19 distinct categories. In addition to the classification of articles, we also performed analysis on different classified groups to find the patterns and trends in the field.
    RESULTS: The classification results demonstrate the effectiveness of LLMs in categorizing a large number of ophthalmology papers without human intervention. The model achieved a mean accuracy of 0.86 and a mean F1-score of 0.85 based on the RenD data set.
    CONCLUSIONS: The proposed framework achieves notable improvements in both accuracy and efficiency. Its application in the domain of ophthalmology showcases its potential for knowledge organization and retrieval. We performed a trend analysis that enables researchers and clinicians to easily categorize and retrieve relevant papers, saving time and effort in literature review and information gathering as well as identification of emerging scientific trends within different disciplines. Moreover, the extendibility of the model to other scientific fields broadens its impact in facilitating research and trend analysis across diverse disciplines.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Multicenter Study
    背景:非结构化格式的电子健康记录(EHR)是临床和生物医学领域研究的宝贵信息来源。然而,在这些记录可用于研究目的之前,在某些情况下,必须删除敏感的健康信息(SHI),以保护患者的隐私。基于规则和基于机器学习的方法已被证明在去识别方面是有效的。然而,很少有研究研究了基于转换器的语言模型和规则的组合。
    目的:本研究的目的是使用规则和转换器为澳大利亚EHR文本注释开发混合去识别管道。该研究还旨在研究预训练单词嵌入和基于转换器的语言模型的影响。
    方法:在本研究中,我们提出了一种称为OpenDeID的混合去识别管道,它是使用澳大利亚基于多中心EHR的语料库OpenDeID语料库开发的。OpenDeID语料库由2100个病理学报告组成,其中有来自1833名患者的38,414个SHI实体。OpenDeID管道结合了关联规则的混合方法,有监督的深度学习,和预先训练的语言模型。
    结果:通过微调DischargeSummaryBiobert模型并结合各种预处理和后处理规则,OpenDeID获得了0.9659的最佳F1得分。OpenDeID管道已部署在大型三级教学医院,并实时处理了8000多个非结构化EHR文本注释。
    结论:OpenDeID管道是一种混合去标识管道,用于去标识非结构化EHR文本注释中的SHI实体。该管道已在大型多中心语料库上进行了评估。外部验证将作为我们未来工作的一部分进行,以评估OpenDeID管道的有效性。
    Electronic health records (EHRs) in unstructured formats are valuable sources of information for research in both the clinical and biomedical domains. However, before such records can be used for research purposes, sensitive health information (SHI) must be removed in several cases to protect patient privacy. Rule-based and machine learning-based methods have been shown to be effective in deidentification. However, very few studies investigated the combination of transformer-based language models and rules.
    The objective of this study is to develop a hybrid deidentification pipeline for Australian EHR text notes using rules and transformers. The study also aims to investigate the impact of pretrained word embedding and transformer-based language models.
    In this study, we present a hybrid deidentification pipeline called OpenDeID, which is developed using an Australian multicenter EHR-based corpus called OpenDeID Corpus. The OpenDeID corpus consists of 2100 pathology reports with 38,414 SHI entities from 1833 patients. The OpenDeID pipeline incorporates a hybrid approach of associative rules, supervised deep learning, and pretrained language models.
    The OpenDeID achieved a best F1-score of 0.9659 by fine-tuning the Discharge Summary BioBERT model and incorporating various preprocessing and postprocessing rules. The OpenDeID pipeline has been deployed at a large tertiary teaching hospital and has processed over 8000 unstructured EHR text notes in real time.
    The OpenDeID pipeline is a hybrid deidentification pipeline to deidentify SHI entities in unstructured EHR text notes. The pipeline has been evaluated on a large multicenter corpus. External validation will be undertaken as part of our future work to evaluate the effectiveness of the OpenDeID pipeline.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:莱姆病是全球报告最多的蜱传疾病之一,使其成为持续的重大公共卫生问题。有效的莱姆病病例报告系统取决于卫生保健专业人员的及时诊断和报告,和临床诊断验证准确的实验室测试和解释。缺乏这些可能会导致诊断和治疗延迟,这可能会加剧莱姆病症状的严重程度。因此,有必要通过使用其他数据源来改善对莱姆病的监测,例如基于Web的数据。
    目的:我们分析了全球Twitter数据,以了解其作为莱姆病监测工具的潜力和局限性。我们提出了一种基于变压器的分类系统,以使用自我报告的推文识别潜在的莱姆病病例。
    方法:我们的初始样本包括从全球130多万莱姆病推文数据库中收集的20,000条推文。在预处理和地理定位推文之后,使用精心选择的关键字,将初始样本的子集中的推文手动标记为潜在的莱姆病病例或非莱姆病病例。表情符号被转换成情感词,然后在推文中被替换。这个标记的推文集合用于训练,验证,和DistilBERT的性能测试(BERT的蒸馏版本[来自变压器的双向编码器表示]),ALBERT(精简版BERT),和BERTweet(BERTforEnglishTweets)分类器。
    结果:实证结果表明,BERTweet是所有评估模型中最好的分类器(平均F1得分为89.3%,分类准确率为90.0%,精密度为97.1%)。然而,为了召回,术语频率逆文档频率和k最近邻表现更好(93.2%和82.6%,分别)。关于使用表情符号来丰富推文嵌入,BERTweet的召回率增加了(增加了8%),DistilBERT的F1评分提高了93.8%(提高了4%),分类准确率提高了94.1%(提高了4%),ALBERT的F1评分提高了93.1%(提高5%),分类准确率提高了93.9%(提高5%).在美国,人们对莱姆病的普遍认识很高,联合王国,澳大利亚,加拿大,这些国家自我报告的莱姆病潜在病例约占收集到的英语推文的50%(9939/20,000),而与莱姆病相关的推文在非洲和亚洲国家很少见。数据中报告最多的莱姆病相关症状是皮疹,疲劳,发烧,和关节炎,虽然症状,如淋巴结病,心悸,淋巴结肿大,颈部僵硬度,和心律失常,并不常见,根据莱姆病的症状频率。
    结论:该研究强调了BERTweet和DistilBERT作为来自自我报告数据的莱姆病潜在病例的分类器的稳健性。结果表明,表情符号是有效的富集,从而提高推文嵌入的准确性和分类器的性能。具体来说,表情符号反映悲伤,同理心,鼓励可以减少假阴性。
    Lyme disease is among the most reported tick-borne diseases worldwide, making it a major ongoing public health concern. An effective Lyme disease case reporting system depends on timely diagnosis and reporting by health care professionals, and accurate laboratory testing and interpretation for clinical diagnosis validation. A lack of these can lead to delayed diagnosis and treatment, which can exacerbate the severity of Lyme disease symptoms. Therefore, there is a need to improve the monitoring of Lyme disease by using other data sources, such as web-based data.
    We analyzed global Twitter data to understand its potential and limitations as a tool for Lyme disease surveillance. We propose a transformer-based classification system to identify potential Lyme disease cases using self-reported tweets.
    Our initial sample included 20,000 tweets collected worldwide from a database of over 1.3 million Lyme disease tweets. After preprocessing and geolocating tweets, tweets in a subset of the initial sample were manually labeled as potential Lyme disease cases or non-Lyme disease cases using carefully selected keywords. Emojis were converted to sentiment words, which were then replaced in the tweets. This labeled tweet set was used for the training, validation, and performance testing of DistilBERT (distilled version of BERT [Bidirectional Encoder Representations from Transformers]), ALBERT (A Lite BERT), and BERTweet (BERT for English Tweets) classifiers.
    The empirical results showed that BERTweet was the best classifier among all evaluated models (average F1-score of 89.3%, classification accuracy of 90.0%, and precision of 97.1%). However, for recall, term frequency-inverse document frequency and k-nearest neighbors performed better (93.2% and 82.6%, respectively). On using emojis to enrich the tweet embeddings, BERTweet had an increased recall (8% increase), DistilBERT had an increased F1-score of 93.8% (4% increase) and classification accuracy of 94.1% (4% increase), and ALBERT had an increased F1-score of 93.1% (5% increase) and classification accuracy of 93.9% (5% increase). The general awareness of Lyme disease was high in the United States, the United Kingdom, Australia, and Canada, with self-reported potential cases of Lyme disease from these countries accounting for around 50% (9939/20,000) of the collected English-language tweets, whereas Lyme disease-related tweets were rare in countries from Africa and Asia. The most reported Lyme disease-related symptoms in the data were rash, fatigue, fever, and arthritis, while symptoms, such as lymphadenopathy, palpitations, swollen lymph nodes, neck stiffness, and arrythmia, were uncommon, in accordance with Lyme disease symptom frequency.
    The study highlights the robustness of BERTweet and DistilBERT as classifiers for potential cases of Lyme disease from self-reported data. The results demonstrated that emojis are effective for enrichment, thereby improving the accuracy of tweet embeddings and the performance of classifiers. Specifically, emojis reflecting sadness, empathy, and encouragement can reduce false negatives.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    药物-靶相互作用(DTIs)被认为是药物设计和药物发现的重要组成部分。迄今为止,许多计算方法被开发用于药物-靶标相互作用,但是由于缺乏实验验证的负面数据集,它们的信息不足以准确预测DTI,不准确的分子特征表示,和无效的DTI分类器。因此,我们通过建立两个经过实验验证的数据集,解决了从未知药物-靶标对中随机选择阴性DTI数据的局限性,并提出了一种基于胶囊网络的框架,称为CapBM-DTI,以捕获药物和靶标的层次关系。它采用来自转换器(BERT)的预训练双向编码器表示,通过迁移学习从目标蛋白中提取上下文序列特征,并采用消息传递神经网络(MPNN)进行化合物的二维图特征提取,以准确,可靠地识别药物-靶标相互作用。我们使用四个经过实验验证的不同大小的DTI数据集,将CapBM-DTI的性能与最先进的方法进行了比较,包括人类(智人)和蠕虫(秀丽隐杆线虫)物种数据集,以及三个子集(新化合物,新的蛋白质,和新的对)。我们的结果表明,该模型在所有实验中都具有鲁棒性能和强大的泛化能力。治疗COVID-19的案例研究证明了该模型在虚拟筛查中的适用性。
    Drug-target interactions (DTIs) are considered a crucial component of drug design and drug discovery. To date, many computational methods were developed for drug-target interactions, but they are insufficiently informative for accurately predicting DTIs due to the lack of experimentally verified negative datasets, inaccurate molecular feature representation, and ineffective DTI classifiers. Therefore, we address the limitations of randomly selecting negative DTI data from unknown drug-target pairs by establishing two experimentally validated datasets and propose a capsule network-based framework called CapBM-DTI to capture hierarchical relationships of drugs and targets, which adopts pre-trained bidirectional encoder representations from transformers (BERT) for contextual sequence feature extraction from target proteins through transfer learning and the message-passing neural network (MPNN) for the 2-D graph feature extraction of compounds to accurately and robustly identify drug-target interactions. We compared the performance of CapBM-DTI with state-of-the-art methods using four experimentally validated DTI datasets of different sizes, including human (Homo sapiens) and worm (Caenorhabditis elegans) species datasets, as well as three subsets (new compounds, new proteins, and new pairs). Our results demonstrate that the proposed model achieved robust performance and powerful generalization ability in all experiments. The case study on treating COVID-19 demonstrates the applicability of the model in virtual screening.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:随着对后COVID-19疾病(PCC)的科学知识的增长,这种疾病的定义仍然存在很大的不确定性,其预期的临床过程,以及它对日常运作的影响。社交媒体平台可以对患者报告的健康结果产生有价值的见解,因为内容是由患者和护理人员以高分辨率制作的,代表大多数临床医生可能无法获得的经验。
    目的:在本研究中,我们旨在确定先进的自然语言处理方法的有效性和有效性,这些方法旨在从社交媒体平台Twitter和Reddit中深入了解PCC相关患者报告的健康结果.我们提取了与PCC相关的术语,包括症状和状况,并测量了它们的发生频率。我们将输出与人类注释和临床结果进行了比较,并跟踪了随时间和地点的症状和条件项发生情况,以探索管道作为监测工具的潜力。
    方法:我们使用了来自变压器(BERT)模型的双向编码器表示,以从Twitter和Reddit上的英语帖子中提取和规范化PCC症状和条件术语。我们比较了2个命名实体识别模型,并实现了2步规范化任务,以将提取的术语映射到标准化术语中的独特概念。归一化步骤是使用BERT双编码器的语义搜索方法完成的。我们使用人类注释的语料库和基于接近度的分数评估了BERT模型在提取术语方面的有效性。我们还将提取和标准化术语的有效性和可靠性与来自多个国家的3000多名参与者的基于网络的调查进行了比较。
    结果:UmlsBERT-Clinical在预测与人类注释者提取的实体最接近的实体方面具有最高的准确性。根据我们的发现,最常见的3组PCC症状和病症是全身性(如疲劳),神经精神(如焦虑和脑雾),和呼吸(如呼吸急促)。此外,我们还发现了在以前的研究中没有分类的新症状和条件术语,如感染和疼痛。关于共同发生的症状,这对疲劳和头痛是两个平台上最常见的术语对之一。根据时间分析,神经精神病学术语是最普遍的,其次是系统类别,在两个社交媒体平台上。我们的空间分析得出结论,42%(10,938/26,247)的分析术语包括位置信息,大多数来自美国,英国,和加拿大。
    结论:我们的社交媒体衍生管道的结果与与PCC症状相关的同行评审文章的结果相当。总的来说,这项研究为患者报告的PCC的健康结果提供了独特的见解,以及有关患者旅程的宝贵信息,可以帮助医疗保健提供者预测未来的需求。
    RR2-10.1101/2022.12.14.22283419。
    While scientific knowledge of post-COVID-19 condition (PCC) is growing, there remains significant uncertainty in the definition of the disease, its expected clinical course, and its impact on daily functioning. Social media platforms can generate valuable insights into patient-reported health outcomes as the content is produced at high resolution by patients and caregivers, representing experiences that may be unavailable to most clinicians.
    In this study, we aimed to determine the validity and effectiveness of advanced natural language processing approaches built to derive insight into PCC-related patient-reported health outcomes from social media platforms Twitter and Reddit. We extracted PCC-related terms, including symptoms and conditions, and measured their occurrence frequency. We compared the outputs with human annotations and clinical outcomes and tracked symptom and condition term occurrences over time and locations to explore the pipeline\'s potential as a surveillance tool.
    We used bidirectional encoder representations from transformers (BERT) models to extract and normalize PCC symptom and condition terms from English posts on Twitter and Reddit. We compared 2 named entity recognition models and implemented a 2-step normalization task to map extracted terms to unique concepts in standardized terminology. The normalization steps were done using a semantic search approach with BERT biencoders. We evaluated the effectiveness of BERT models in extracting the terms using a human-annotated corpus and a proximity-based score. We also compared the validity and reliability of the extracted and normalized terms to a web-based survey with more than 3000 participants from several countries.
    UmlsBERT-Clinical had the highest accuracy in predicting entities closest to those extracted by human annotators. Based on our findings, the top 3 most commonly occurring groups of PCC symptom and condition terms were systemic (such as fatigue), neuropsychiatric (such as anxiety and brain fog), and respiratory (such as shortness of breath). In addition, we also found novel symptom and condition terms that had not been categorized in previous studies, such as infection and pain. Regarding the co-occurring symptoms, the pair of fatigue and headaches was among the most co-occurring term pairs across both platforms. Based on the temporal analysis, the neuropsychiatric terms were the most prevalent, followed by the systemic category, on both social media platforms. Our spatial analysis concluded that 42% (10,938/26,247) of the analyzed terms included location information, with the majority coming from the United States, United Kingdom, and Canada.
    The outcome of our social media-derived pipeline is comparable with the results of peer-reviewed articles relevant to PCC symptoms. Overall, this study provides unique insights into patient-reported health outcomes of PCC and valuable information about the patient\'s journey that can help health care providers anticipate future needs.
    RR2-10.1101/2022.12.14.22283419.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    Umami肽由于其增强风味和提供营养益处的能力而受到广泛关注。对新型鲜味肽和食品中存在的大量肽的需求不断增加,要求更有效的方法来筛选鲜味肽,需要进一步探索。因此,本研究旨在开发深度学习(DL)模型,实现鲜味肽的快速筛选。Umami-BERT模型是利用新颖的两阶段训练策略设计的,该策略具有来自变压器的双向编码器表示(BERT)和初始网络。在训练前阶段,在大量生物活性肽序列上实施了注意力机制,以获得高维的广义特征.在再训练阶段,在UMP789数据集上进行了鲜味肽预测,这是通过最新研究开发的。该模型在平衡数据集上实现了93.23%的准确度(ACC)和0.78的MCC的性能,以及不平衡数据集上的95.00%的ACC和0.85的MCC。结果表明,Umami-BERT可以直接从其氨基酸序列预测鲜味肽,并且超过了其他模型的性能。此外,Umami-BERT能够分析Umami-BERT模型学习的注意力模式。氨基酸丙氨酸(A),半胱氨酸(C),天冬氨酸(D),发现谷氨酸(E)是鲜味肽的最重要贡献者。此外,总结的鲜味肽的模式涉及A,C,D,和E根据学习的注意力权重进行分析。因此,Umami-BERT在大规模筛选候选肽方面表现出巨大的潜力,并为进一步探索鲜味肽提供了新的见解。
    Umami peptides have received extensive attention due to their ability to enhance flavors and provide nutritional benefits. The increasing demand for novel umami peptides and the vast number of peptides present in food call for more efficient methods to screen umami peptides, and further exploration is necessary. Therefore, the purpose of this study is to develop deep learning (DL) model to realize rapid screening of umami peptides. The Umami-BERT model was devised utilizing a novel two-stage training strategy with Bidirectional Encoder Representations from Transformers (BERT) and the inception network. In the pre-training stage, attention mechanisms were implemented on a large amount of bioactive peptides sequences to acquire high-dimensional generalized features. In the re-training stage, umami peptide prediction was carried out on UMP789 dataset, which is developed through the latest research. The model achieved the performance with an accuracy (ACC) of 93.23% and MCC of 0.78 on the balanced dataset, as well as an ACC of 95.00% and MCC of 0.85 on the unbalanced dataset. The results demonstrated that Umami-BERT could predict umami peptides directly from their amino acid sequences and exceeded the performance of other models. Furthermore, Umami-BERT enabled the analysis of attention pattern learned by Umami-BERT model. The amino acids Alanine (A), Cysteine (C), Aspartate (D), and Glutamicacid (E) were found to be the most significant contributors to umami peptides. Additionally, the patterns of summarized umami peptides involving A, C, D, and E were analyzed based on the learned attention weights. Consequently, Umami-BERT exhibited great potential in the large-scale screening of candidate peptides and offers novel insight for the further exploration of umami peptides.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

公众号