BERT

BERT
  • 文章类型: Journal Article
    转移性乳腺癌(MBC)仍然是女性癌症相关死亡的主要原因。这项工作介绍了一种创新的非侵入性乳腺癌分类模型,旨在改善癌症转移的识别。虽然这项研究标志着预测MBC的初步探索,额外的调查对于验证MBC的发生至关重要.我们的方法结合了大型语言模型(LLM)的优势,特别是来自变压器(BERT)模型的双向编码器表示,图神经网络(GNN)的强大功能,可根据组织病理学报告预测MBC患者。本文介绍了一种用于转移性乳腺癌预测(BG-MBC)的BERT-GNN方法,该方法集成了从BERT模型得出的图形信息。在这个模型中,节点是根据病人的医疗记录构建的,虽然BERT嵌入被用来对组织病理学报告中的单词进行矢量化表示,从而通过采用三种不同的方法(即单变量选择,用于特征重要性的额外树分类器,和Shapley值,以确定影响最显著的特征)。确定在模型训练期间作为嵌入生成的676个中最关键的30个特征,我们的模型进一步增强了其预测能力。BG-MBC模型具有出色的准确性,在识别MBC患者时,检出率为0.98,曲线下面积(AUC)为0.98。这种显著的表现归功于模型对LLM从组织病理学报告中产生的注意力得分的利用,有效地捕获相关特征进行分类。
    Metastatic breast cancer (MBC) continues to be a leading cause of cancer-related deaths among women. This work introduces an innovative non-invasive breast cancer classification model designed to improve the identification of cancer metastases. While this study marks the initial exploration into predicting MBC, additional investigations are essential to validate the occurrence of MBC. Our approach combines the strengths of large language models (LLMs), specifically the bidirectional encoder representations from transformers (BERT) model, with the powerful capabilities of graph neural networks (GNNs) to predict MBC patients based on their histopathology reports. This paper introduces a BERT-GNN approach for metastatic breast cancer prediction (BG-MBC) that integrates graph information derived from the BERT model. In this model, nodes are constructed from patient medical records, while BERT embeddings are employed to vectorise representations of the words in histopathology reports, thereby capturing semantic information crucial for classification by employing three distinct approaches (namely univariate selection, extra trees classifier for feature importance, and Shapley values to identify the features that have the most significant impact). Identifying the most crucial 30 features out of 676 generated as embeddings during model training, our model further enhances its predictive capabilities. The BG-MBC model achieves outstanding accuracy, with a detection rate of 0.98 and an area under curve (AUC) of 0.98, in identifying MBC patients. This remarkable performance is credited to the model\'s utilisation of attention scores generated by the LLM from histopathology reports, effectively capturing pertinent features for classification.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:研究空白是指现有知识体系中未回答的问题,由于缺乏研究或结果不确定。研究差距是科学研究的重要起点和动力。确定研究差距的传统方法,如文献综述和专家意见,可能很耗时,劳动密集型,而且容易产生偏见.在处理快速发展或时间敏感的主题时,它们也可能不足。因此,需要创新的可扩展方法来确定研究差距,系统地评估文献,并优先考虑感兴趣的主题的进一步研究领域。
    目的:在本文中,我们提出了一种基于机器学习的方法,通过分析科学文献来识别研究差距。我们使用COVID-19大流行作为案例研究。
    方法:我们使用COVID-19开放研究(CORD-19)数据集进行了分析,以确定COVID-19文献中的研究空白,其中包括1,121,433篇与COVID-19大流行有关的论文。我们的方法基于BERTopic主题建模技术,它利用转换器和基于类的术语频率-逆文档频率来创建密集的集群,从而允许易于解释的主题。我们基于BERTopic的方法涉及3个阶段:嵌入文档,聚类文档(降维和聚类),和代表主题(生成候选和最大化候选相关性)。
    结果:应用研究选择标准后,我们在本研究的分析中纳入了33,206篇摘要.最终的研究差距清单确定了21个不同的领域,分为6个主要主题。这些主题是:\“COVID-19的病毒”,\“COVID-19的危险因素”,\“预防COVID-19”,\“COVID-19的治疗”,\“COVID-19期间的医疗保健服务,\”和COVID-19的影响。\"最突出的话题,在超过一半的分析研究中观察到,是“COVID-19的影响。
    结论:提出的基于机器学习的方法有可能发现科学文献中的研究空白。本研究并非旨在取代选定主题内的个别文献研究。相反,它可以作为指导,在与以前的出版物指定用于未来探索的研究问题相关的特定领域制定精确的文献检索查询。未来的研究应该利用从目标区域最常见的数据库中检索到的最新研究列表。在可行的情况下,全文或,至少,应该对讨论部分进行分析,而不是将其分析局限于摘要。此外,未来的研究可以评估更有效的建模算法,尤其是那些将主题建模与统计不确定性量化相结合的方法,如共形预测。
    BACKGROUND: Research gaps refer to unanswered questions in the existing body of knowledge, either due to a lack of studies or inconclusive results. Research gaps are essential starting points and motivation in scientific research. Traditional methods for identifying research gaps, such as literature reviews and expert opinions, can be time consuming, labor intensive, and prone to bias. They may also fall short when dealing with rapidly evolving or time-sensitive subjects. Thus, innovative scalable approaches are needed to identify research gaps, systematically assess the literature, and prioritize areas for further study in the topic of interest.
    OBJECTIVE: In this paper, we propose a machine learning-based approach for identifying research gaps through the analysis of scientific literature. We used the COVID-19 pandemic as a case study.
    METHODS: We conducted an analysis to identify research gaps in COVID-19 literature using the COVID-19 Open Research (CORD-19) data set, which comprises 1,121,433 papers related to the COVID-19 pandemic. Our approach is based on the BERTopic topic modeling technique, which leverages transformers and class-based term frequency-inverse document frequency to create dense clusters allowing for easily interpretable topics. Our BERTopic-based approach involves 3 stages: embedding documents, clustering documents (dimension reduction and clustering), and representing topics (generating candidates and maximizing candidate relevance).
    RESULTS: After applying the study selection criteria, we included 33,206 abstracts in the analysis of this study. The final list of research gaps identified 21 different areas, which were grouped into 6 principal topics. These topics were: \"virus of COVID-19,\" \"risk factors of COVID-19,\" \"prevention of COVID-19,\" \"treatment of COVID-19,\" \"health care delivery during COVID-19,\" \"and impact of COVID-19.\" The most prominent topic, observed in over half of the analyzed studies, was \"the impact of COVID-19.\"
    CONCLUSIONS: The proposed machine learning-based approach has the potential to identify research gaps in scientific literature. This study is not intended to replace individual literature research within a selected topic. Instead, it can serve as a guide to formulate precise literature search queries in specific areas associated with research questions that previous publications have earmarked for future exploration. Future research should leverage an up-to-date list of studies that are retrieved from the most common databases in the target area. When feasible, full texts or, at minimum, discussion sections should be analyzed rather than limiting their analysis to abstracts. Furthermore, future studies could evaluate more efficient modeling algorithms, especially those combining topic modeling with statistical uncertainty quantification, such as conformal prediction.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    病理报告包含有关患者诊断的关键信息以及重要的总体和微观发现。这些信息丰富的临床报告为临床研究提供了宝贵的资源,但是从这种非结构化文本中提取和分析数据通常是手动且乏味的。虽然神经信息检索系统(通常实现为自然语言处理的深度学习方法)是自动和灵活的,他们通常需要一个大型的特定领域的文本语料库进行训练,使它们在许多医学子领域不可行。因此,不需要大型训练语料库的病理报告自动数据提取方法将具有重要的价值和实用性。
    开发一种基于语言模型的神经信息检索系统,该系统可以在小型数据集上进行训练,并通过在肾脏移植病理学报告上进行训练来验证,以提取两个预定义问题的相关信息:(1)“患者表现出什么样的排斥反应?”;(2)“间质纤维化和肾小管萎缩(IFTA)的等级是多少?”然后,exKidneyBERT是通过使用六个技术关键字扩展ClinicalBERT的标记器并重复训练前程序而开发的。这扩展了模型的词汇量。所有三个模型都用信息检索头进行了微调。
    具有扩展词汇量的模型,exKidneyBERT,在两个问题中都优于临床BERT和肾脏BERT。对于拒绝,exKidneyBERT在抗体介导的排斥反应(ABMR)中的重叠率为83.3%,在T细胞介导的排斥反应(TCMR)中的重叠率为79.2%。对于IFTA,exKidneyBERT的精确匹配率为95.8%。
    ExKidneyBERT是从肾脏病理报告中提取信息的高性能模型。在专门的小领域上对BERT语言模型进行额外的预训练不一定会提高性能。扩展BERTtokenizer的词汇库对于提高性能的专业领域至关重要,特别是在小语料库的预训练时。
    UNASSIGNED: Pathology reports contain key information about the patient\'s diagnosis as well as important gross and microscopic findings. These information-rich clinical reports offer an invaluable resource for clinical studies, but data extraction and analysis from such unstructured texts is often manual and tedious. While neural information retrieval systems (typically implemented as deep learning methods for natural language processing) are automatic and flexible, they typically require a large domain-specific text corpus for training, making them infeasible for many medical subdomains. Thus, an automated data extraction method for pathology reports that does not require a large training corpus would be of significant value and utility.
    UNASSIGNED: To develop a language model-based neural information retrieval system that can be trained on small datasets and validate it by training it on renal transplant-pathology reports to extract relevant information for two predefined questions: (1) \"What kind of rejection does the patient show?\"; (2) \"What is the grade of interstitial fibrosis and tubular atrophy (IFTA)?\"
    UNASSIGNED: Kidney BERT was developed by pre-training Clinical BERT on 3.4K renal transplant pathology reports and 1.5M words. Then, exKidneyBERT was developed by extending Clinical BERT\'s tokenizer with six technical keywords and repeating the pre-training procedure. This extended the model\'s vocabulary. All three models were fine-tuned with information retrieval heads.
    UNASSIGNED: The model with extended vocabulary, exKidneyBERT, outperformed Clinical BERT and Kidney BERT in both questions. For rejection, exKidneyBERT achieved an 83.3% overlap ratio for antibody-mediated rejection (ABMR) and 79.2% for T-cell mediated rejection (TCMR). For IFTA, exKidneyBERT had a 95.8% exact match rate.
    UNASSIGNED: ExKidneyBERT is a high-performing model for extracting information from renal pathology reports. Additional pre-training of BERT language models on specialized small domains does not necessarily improve performance. Extending the BERT tokenizer\'s vocabulary library is essential for specialized domains to improve performance, especially when pre-training on small corpora.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    基于Transformer的解决自然语言处理(NLP)任务(如BERT和GPT)的方法由于其实现高性能的能力而越来越受欢迎。这些方法受益于使用巨大的数据大小来创建预先训练的模型以及理解句子中单词上下文的能力。它们在信息检索领域的使用被认为可以提高有效性和效率。本文演示了一种基于BERT的方法(CASBRT)实现,用于在使用本体综合注释的数据上构建搜索工具。数据是使用Physiome模型库(PMR)中的CellML标准编写的生物模拟模型的集合。生物模拟模型在结构上由常量和变量的基本实体组成,这些实体构造了更高级别的实体,例如组件,reactions,和模型。找到这些特定于其级别的实体对于有关变量重用的各种目的是有益的,实验设置,和模型审计。最初,我们为常量和变量搜索(最低级别实体)创建了表示复合注释实体的嵌入。然后,这些低级实体嵌入被垂直和有效地组合在一起,以创建更高级的实体嵌入来搜索组件,模型,images,和模拟设置。我们的方法是一般的,因此,它可以用于创建搜索工具,其中包含其他语义注释的数据-以SBML格式编码的生物模拟模型,例如。我们的工具被命名为生物模拟模型搜索引擎(BMSE)。
    The Transformer-based approaches to solving natural language processing (NLP) tasks such as BERT and GPT are gaining popularity due to their ability to achieve high performance. These approaches benefit from using enormous data sizes to create pre-trained models and the ability to understand the context of words in a sentence. Their use in the information retrieval domain is thought to increase effectiveness and efficiency. This paper demonstrates a BERT-based method (CASBERT) implementation to build a search tool over data annotated compositely using ontologies. The data was a collection of biosimulation models written using the CellML standard in the Physiome Model Repository (PMR). A biosimulation model structurally consists of basic entities of constants and variables that construct higher-level entities such as components, reactions, and the model. Finding these entities specific to their level is beneficial for various purposes regarding variable reuse, experiment setup, and model audit. Initially, we created embeddings representing compositely-annotated entities for constant and variable search (lowest level entity). Then, these low-level entity embeddings were vertically and efficiently combined to create higher-level entity embeddings to search components, models, images, and simulation setups. Our approach was general, so it can be used to create search tools with other data semantically annotated with ontologies - biosimulation models encoded in the SBML format, for example. Our tool is named Biosimulation Model Search Engine (BMSE).
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    医学图像标签的稀缺性是训练基于深度神经网络的表示学习方法的重要障碍。当使用存储在图片存档通信系统(PACS)中的常规临床护理期间收集的成像数据时,这种限制也存在。因为这些数据很少附加医学图像计算任务所需的高质量标签。然而,从PACS中提取的医学图像通常与包含重要信息的描述性放射学报告相结合,可以用于预训练成像模型。这可以作为进一步的任务特定微调的起点。在这项工作中,我们对三种不同的自我监督策略进行头对头比较,以在3D脑计算机断层扫描血管造影(CTA)图像上预先训练相同的成像模型,大血管闭塞(LVO)检测作为下游任务。这些策略评估了两种自然语言处理(NLP)方法,一个用于提取100个明确的放射学概念(Rad-SpatialNet),另一个用于创建通用放射学报告嵌入(DistilBERT)。此外,我们尝试直接或通过使用最近的自监督学习方法(CLIP)来学习放射学概念,该方法通过对语言和图像矢量嵌入之间的距离进行排名来学习。选择LVO检测任务是因为它需要3D成像数据,在临床上很重要,并要求算法学习放射学报告中未明确说明的输出。对包含1,542个3DCTA-报告对的未标记数据集进行预训练。在402名受试者的标记数据集上测试下游任务的LVO。我们发现,与仅在标记数据上训练的模型相比,使用基于CLIP的策略进行的预训练提高了成像模型检测LVO的性能。通过使用明确的放射学概念和CLIP策略进行预训练,可以实现最佳性能。
    Scarcity of labels for medical images is a significant barrier for training representation learning approaches based on deep neural networks. This limitation is also present when using imaging data collected during routine clinical care stored in picture archiving communication systems (PACS), as these data rarely have attached the high-quality labels required for medical image computing tasks. However, medical images extracted from PACS are commonly coupled with descriptive radiology reports that contain significant information and could be leveraged to pre-train imaging models, which could serve as starting points for further task-specific fine-tuning. In this work, we perform a head-to-head comparison of three different self-supervised strategies to pre-train the same imaging model on 3D brain computed tomography angiogram (CTA) images, with large vessel occlusion (LVO) detection as the downstream task. These strategies evaluate two natural language processing (NLP) approaches, one to extract 100 explicit radiology concepts (Rad-SpatialNet) and the other to create general-purpose radiology reports embeddings (DistilBERT). In addition, we experiment with learning radiology concepts directly or by using a recent self-supervised learning approach (CLIP) that learns by ranking the distance between language and image vector embeddings. The LVO detection task was selected because it requires 3D imaging data, is clinically important, and requires the algorithm to learn outputs not explicitly stated in the radiology report. Pre-training was performed on an unlabeled dataset containing 1,542 3D CTA - reports pairs. The downstream task was tested on a labeled dataset of 402 subjects for LVO. We find that the pre-training performed with CLIP-based strategies improve the performance of the imaging model to detect LVO compared to a model trained only on the labeled data. The best performance was achieved by pre-training using the explicit radiology concepts and CLIP strategy.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    了解医疗报告的诊断目标对于了解患者流程是有价值的信息。这项工作的重点是提取原因采取MRI扫描的多发性硬化症(MS)患者使用所附自由形式的报告:诊断,进展或监测。我们研究了与领域相关的通用最新语言模型的性能及其与领域专业知识的一致性。为此,可解释的人工智能(XAI)技术被用来洞察模型的内部工作原理,它们的可信度得到了验证。然后将经过验证的XAI解释与领域专家的解释进行比较,间接判断模型的可靠性。BERTJE,一种来自变形金刚(BERT)模型的荷兰双向编码器表示,表现优于RobBERT和MedRoberTa。nl的准确性和可靠性。后一种模式(MedRoBERTa。nl)是一个特定于领域的模型,虽然BERTje是一个通用模型,表明特定领域的模型并不总是优越的。我们在一项小型前瞻性研究中对BERTje的验证显示了该模型在实际环境中的潜在吸收的有希望的结果。
    Understanding the diagnostic goal of medical reports is valuable information for understanding patient flows. This work focuses on extracting the reason for taking an MRI scan of Multiple Sclerosis (MS) patients using the attached free-form reports: Diagnosis, Progression or Monitoring. We investigate the performance of domain-dependent and general state-of-the-art language models and their alignment with domain expertise. To this end, eXplainable Artificial Intelligence (XAI) techniques are used to acquire insight into the inner workings of the model, which are verified on their trustworthiness. The verified XAI explanations are then compared with explanations from a domain expert, to indirectly determine the reliability of the model. BERTje, a Dutch Bidirectional Encoder Representations from Transformers (BERT) model, outperforms RobBERT and MedRoBERTa.nl in both accuracy and reliability. The latter model (MedRoBERTa.nl) is a domain-specific model, while BERTje is a generic model, showing that domain-specific models are not always superior. Our validation of BERTje in a small prospective study shows promising results for the potential uptake of the model in a practical setting.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    医学笔记是以自由文本格式描述患者健康的叙述。这些注释可能比结构化数据(例如药物史或疾病状况)更多。它们是常规收集的,可用于评估患者患痴呆症等慢性疾病的风险。本研究调查了将常规护理笔记转换为痴呆症风险分类器的不同方法,并评估了这些分类器对新患者和新医疗机构的可泛化性。
    收集患者相关病史的记录很长。在这项研究中,TF-ICF用于选择在有风险的痴呆患者和健康对照之间具有最高辨别能力的关键词。然后,以所选择的关键词的出现的形式来总结医学笔记。比较总结的两种不同编码。第一编码由BERT或临床BERT预训练语言模型产生的每个关键词出现的向量嵌入的平均值组成。第二编码根据UMLS概念聚合关键字,并使用每个概念作为曝光变量。对于这两种编码,还考虑了所选关键字的拼写错误,以提高分类器的预测性能。在第一编码上开发神经网络,并且将梯度提升树模型应用于第二编码。来自单个医疗机构的患者用于开发所有分类器,然后对来自同一医疗机构的滞留患者以及来自其他两个医疗机构的测试患者进行评估。
    结果表明,当梯度增强树模型与从UMLS概念导出的暴露变量结合使用时,使用AUC为75%的医学笔记,可以在疾病发作前一年识别有痴呆风险的患者。然而,当分类器应用于来自其他医疗保健机构的患者时,这种性能不能用嵌入的特征空间保持。此外,对梯度提升树模型的顶级预测因子的分析表明,根据是否包括关键词的拼写变体,不同的特征为分类提供信息。
    本研究表明,医学笔记可以为复杂的慢性疾病(如痴呆症)建立风险预测模型。然而,需要更多的研究努力来提高这些模型的普遍性。这些努力应考虑到医疗笔记的长度和定位;每种疾病状况的足够训练数据的可用性;以及不同特征工程技术导致的变化。
    UNASSIGNED: Medical notes are narratives that describe the health of the patient in free text format. These notes can be more informative than structured data such as the history of medications or disease conditions. They are routinely collected and can be used to evaluate the patient\'s risk for developing chronic diseases such as dementia. This study investigates different methodologies for transforming routine care notes into dementia risk classifiers and evaluates the generalizability of these classifiers to new patients and new health care institutions.
    UNASSIGNED: The notes collected over the relevant history of the patient are lengthy. In this study, TF-ICF is used to select keywords with the highest discriminative ability between at risk dementia patients and healthy controls. The medical notes are then summarized in the form of occurrences of the selected keywords. Two different encodings of the summary are compared. The first encoding consists of the average of the vector embedding of each keyword occurrence as produced by the BERT or Clinical BERT pre-trained language models. The second encoding aggregates the keywords according to UMLS concepts and uses each concept as an exposure variable. For both encodings, misspellings of the selected keywords are also considered in an effort to improve the predictive performance of the classifiers. A neural network is developed over the first encoding and a gradient boosted trees model is applied to the second encoding. Patients from a single health care institution are used to develop all the classifiers which are then evaluated on held-out patients from the same health care institution as well as test patients from two other health care institutions.
    UNASSIGNED: The results indicate that it is possible to identify patients at risk for dementia one year ahead of the onset of the disease using medical notes with an AUC of 75% when a gradient boosted trees model is used in conjunction with exposure variables derived from UMLS concepts. However, this performance is not maintained with an embedded feature space and when the classifier is applied to patients from other health care institutions. Moreover, an analysis of the top predictors of the gradient boosted trees model indicates that different features inform the classification depending on whether or not spelling variants of the keywords are included.
    UNASSIGNED: The present study demonstrates that medical notes can enable risk prediction models for complex chronic diseases such as dementia. However, additional research efforts are needed to improve the generalizability of these models. These efforts should take into consideration the length and localization of the medical notes; the availability of sufficient training data for each disease condition; and the variabilities resulting from different feature engineering techniques.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    深度学习神经语言模型(如Transformers的双向编码器表示(BERT))的最新成功为计算语言研究带来了创新。本研究探讨了在研究人类语言过程中使用语言模型的可能性,基于负极性项目(NPI)的案例研究。我们首先对BERT进行了一项实验,以检查该模型是否成功捕获了NPI与其许可方之间的层次结构关系,以及是否可能导致类似于心理语言学实验(实验1)中显示的语法错觉的错误。我们还研究了语言模型是否可以捕获NPI许可方的细粒度语义属性,并在许可强度的规模上区分它们的细微差别(实验2)。两个实验的结果表明,神经语言模型对NPI处理中的句法和语义约束高度敏感。该模型的处理模式和灵敏度与人类非常接近,表明他们在语言研究中作为研究工具或对象的作用。
    The recent success of deep learning neural language models such as Bidirectional Encoder Representations from Transformers (BERT) has brought innovations to computational language research. The present study explores the possibility of using a language model in investigating human language processes, based on the case study of negative polarity items (NPIs). We first conducted an experiment with BERT to examine whether the model successfully captures the hierarchical structural relationship between an NPI and its licensor and whether it may lead to an error analogous to the grammatical illusion shown in the psycholinguistic experiment (Experiment 1). We also investigated whether the language model can capture the fine-grained semantic properties of NPI licensors and discriminate their subtle differences on the scale of licensing strengths (Experiment 2). The results of the two experiments suggest that overall, the neural language model is highly sensitive to both syntactic and semantic constraints in NPI processing. The model\'s processing patterns and sensitivities are shown to be very close to humans, suggesting their role as a research tool or object in the study of language.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    物联网是通过互联网将多个智能设备互连以向用户提供无处不在的服务的范例。这种范例和Web2.0平台生成了无数的文本数据。因此,在这种情况下,一个重要的挑战是自动执行文本分类。最近,通过在由在线新闻组成的语料库上从头开始训练的语言模型来更好地处理文本分类,可以获得最先进的结果。我们可以强调的语言模型是BERT(Transformers的双向编码器表示),并且DistilBERT是一个预先训练的较小的通用语言表示模型。在这种情况下,通过一个案例研究,我们建议在不同的数据集中使用两种语言(英语和巴西葡萄牙语)的两个先前提到的模型执行文本分类任务。结果表明,DistilBERT对英语和巴西葡萄牙语的训练时间比其较大的训练时间快约45%,它也小了40%,并为平衡的数据集保留了约96%的语言理解技能。
    The Internet of Things is a paradigm that interconnects several smart devices through the internet to provide ubiquitous services to users. This paradigm and Web 2.0 platforms generate countless amounts of textual data. Thus, a significant challenge in this context is automatically performing text classification. State-of-the-art outcomes have recently been obtained by employing language models trained from scratch on corpora made up from news online to handle text classification better. A language model that we can highlight is BERT (Bidirectional Encoder Representations from Transformers) and also DistilBERT is a pre-trained smaller general-purpose language representation model. In this context, through a case study, we propose performing the text classification task with two previously mentioned models for two languages (English and Brazilian Portuguese) in different datasets. The results show that DistilBERT\'s training time for English and Brazilian Portuguese was about 45% faster than its larger counterpart, it was also 40% smaller, and preserves about 96% of language comprehension skills for balanced datasets.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    在放射学报告中建立COVID-19的文件级分类器可以帮助提供者进行日常临床工作,以及为计算机视觉模型创建大量标签。我们通过微调从RadBERT初始化的类似BERT的模型来开发这样的分类器,对放射学报告进行持续的预培训,可用于所有放射学相关任务。RadBERT在这项COVID-19任务中的表现优于所有生物医学预培训(P<0.01),并帮助我们的微调模型获得88.9的宏观平均F1分数,在X线和CT报告中进行评估。为了建立这个模型,我们依赖于多机构数据集,重新采样和丰富并发肺部疾病,帮助模型抵制分配转移。此外,我们探索各种微调和超参数优化技术,加速微调收敛,稳定的性能,提高准确性,特别是当数据或计算资源有限时。最后,我们提供了一套可视化工具和可解释性方法,以更好地理解模型的性能,并支持其在临床环境中的实际使用。我们的方法提供了一个现成的COVID-19分类器,可以类似地应用于其他放射学报告分类任务。
    Building a document-level classifier for COVID-19 on radiology reports could help assist providers in their daily clinical routine, as well as create large numbers of labels for computer vision models. We have developed such a classifier by fine-tuning a BERT-like model initialized from RadBERT, its continuous pre-training on radiology reports that can be used on all radiology-related tasks. RadBERT outperforms all biomedical pre-trainings on this COVID-19 task (P<0.01) and helps our fine-tuned model achieve an 88.9 macro-averaged F1-score, when evaluated on both X-ray and CT reports. To build this model, we rely on a multi-institutional dataset re-sampled and enriched with concurrent lung diseases, helping the model to resist to distribution shifts. In addition, we explore a variety of fine-tuning and hyperparameter optimization techniques that accelerate fine-tuning convergence, stabilize performance, and improve accuracy, especially when data or computational resources are limited. Finally, we provide a set of visualization tools and explainability methods to better understand the performance of the model, and support its practical use in the clinical setting. Our approach offers a ready-to-use COVID-19 classifier and can be applied similarly to other radiology report classification tasks.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

公众号