Data imbalance

数据不平衡
  • 文章类型: Journal Article
    目的:本研究旨在解决使用心电图(ECG)进行不平衡心跳分类的挑战。在这个提出的新颖的深度学习方法中,重点是准确识别以ECG数据显着失衡为特征的少数群体。

方法:我们提出了一种通过动态少数群体偏置批量加权损失函数增强的特征融合神经网络。该网络包括三个专门的分支:完整的ECG数据分支,用于全面查看ECG信号,本地QRS波分支,用于QRS波群的详细特征,和R波信息分支分析R波特征。该结构被设计为提取ECG数据的不同方面。动态损失函数优先考虑少数类,同时保持对多数类的识别,在不改变原始数据分布的情况下调整网络的学习重点。一起,这种融合结构和自适应损失函数显著提高了网络区分各种心跳类别的能力,提高了少数民族阶级识别的准确性。

主要结果:所提出的方法在MIT-BIH数据集中展示了平衡的性能,尤其是少数民族。在患者内部范式下,准确性,灵敏度,特异性,室上性异位搏动的阳性预测值(PPV)为99.63%,93.62%,99.81%,92.98%,分别,融合节拍为99.76%,85.56%,99.87%,和84.16%,分别。在患者间范式下,这些指标是96.56%,89.16%,96.84%,室上性异位搏动为51.99%,和96.10%,77.06%,96.25%,和13.92%的融合节拍,分别。

意义:该方法有效地解决了ECG数据集中的类不平衡。通过利用不同的ECG信号信息和新颖的损失函数,这种方法为心脏疾病的诊断和治疗提供了有希望的工具. .
    Objective.This study aims to address the challenges of imbalanced heartbeat classification using electrocardiogram (ECG). In this proposed novel deep-learning method, the focus is on accurately identifying minority classes in conditions characterized by significant imbalances in ECG data.Approach.We propose a feature fusion neural network enhanced by a dynamic minority-biased batch weighting loss function. This network comprises three specialized branches: the complete ECG data branch for a comprehensive view of ECG signals, the local QRS wave branch for detailed features of the QRS complex, and theRwave information branch to analyzeRwave characteristics. This structure is designed to extract diverse aspects of ECG data. The dynamic loss function prioritizes minority classes while maintaining the recognition of majority classes, adjusting the network\'s learning focus without altering the original data distribution. Together, this fusion structure and adaptive loss function significantly improve the network\'s ability to distinguish between various heartbeat classes, enhancing the accuracy of minority class identification.Main results.The proposed method demonstrated balanced performance within the MIT-BIH dataset, especially for minority classes. Under the intra-patient paradigm, the accuracy, sensitivity, specificity, and positive predictive value for Supraventricular ectopic beat were 99.63%, 93.62%, 99.81%, and 92.98%, respectively, and for Fusion beat were 99.76%, 85.56%, 99.87%, and 84.16%, respectively. Under the inter-patient paradigm, these metrics were 96.56%, 89.16%, 96.84%, and 51.99%for Supraventricular ectopic beat, and 96.10%, 77.06%, 96.25%, and 13.92%for Fusion beat, respectively.Significance.This method effectively addresses the class imbalance in ECG datasets. By leveraging diverse ECG signal information and a novel loss function, this approach offers a promising tool for aiding in the diagnosis and treatment of cardiac conditions.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    准确识别CRISPR/Cas9系统中潜在的脱靶位点对于提高编辑效率和安全性至关重要。然而,可用的目标外数据集的不平衡对提高预测性能构成了主要障碍。尽管已经开发了几种预测模型来解决这个问题,目前仍缺乏对脱靶预测中数据失衡处理的系统研究。本文系统地研究了非目标数据集中的数据不平衡问题,并从新的角度探索了处理数据不平衡的多种方法。首先,我们通过确定这些数据集中存在的失衡比率来强调失衡问题对脱靶预测任务的影响.然后,我们全面回顾了各种抽样技术和成本敏感方法,以减轻非目标数据集中的类别失衡.最后,系统的实验进行了几个国家的最先进的预测模型,以说明应用数据不平衡解决方案的影响。结果表明,类不平衡处理方法显著提高了模型跨多个测试数据集的脱靶预测能力。本研究中使用的代码和数据集可在https://github.com/gzrgzx/CRISPR_Data_Imbalance获得。
    Accurately identifying potential off-target sites in the CRISPR/Cas9 system is crucial for improving the efficiency and safety of editing. However, the imbalance of available off-target datasets has posed a major obstacle in enhancing prediction performance. Despite several prediction models have been developed to address this issue, there remains a lack of systematic research on handling data imbalance in off-target prediction. This article systematically investigates the data imbalance issue in off-target datasets and explores numerous methods to process data imbalance from a novel perspective. First, we highlight the impact of the imbalance problem on off-target prediction tasks by determining the imbalance ratios present in these datasets. Then, we provide a comprehensive review of various sampling techniques and cost-sensitive methods to mitigate class imbalance in off-target datasets. Finally, systematic experiments are conducted on several state-of-the-art prediction models to illustrate the impact of applying data imbalance solutions. The results show that class imbalance processing methods significantly improve the off-target prediction capabilities of the models across multiple testing datasets. The code and datasets used in this study are available at https://github.com/gzrgzx/CRISPR_Data_Imbalance.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    尽管可电离脂质纳米颗粒(LNPs)在信使RNA(mRNA)递送的临床应用中广泛使用,mRNA药物递送系统在LNP的筛选中面临着有效的挑战。传统的筛选方法通常需要大量的实验时间并且招致高的研发成本。为了加快LNP的早期开发阶段,我们提出了TransLNP,基于变压器的转染预测模型,旨在帮助选择mRNA药物递送系统的LNP。TransLNP使用两种类型的分子信息来感知结构与转染效率之间的关系:粗粒度的原子序列信息和细粒度的原子空间关系信息。由于现有LNP实验数据的稀缺性,我们发现预训练分子模型对于更好地理解预测LNP属性的任务至关重要,这是通过重建原子3D坐标和掩蔽原子预测来实现的。此外,数据失衡问题在现实世界的LNP探索中尤为突出。我们引入BalMol块通过平滑标记和分子特征的分布来解决这个问题。在随机和支架数据分割下,我们的方法在转染特性预测方面均优于最新技术。此外,我们建立了分子结构相似性和转染差异之间的关系,选择4267对分子转染悬崖,它们是具有高度结构相似性但转染效率显着差异的分子对,从而揭示了预测误差的主要来源。代码,模型和数据可在https://github.com/wklix/TransLNP上公开获得。
    Despite the widespread use of ionizable lipid nanoparticles (LNPs) in clinical applications for messenger RNA (mRNA) delivery, the mRNA drug delivery system faces an efficient challenge in the screening of LNPs. Traditional screening methods often require a substantial amount of experimental time and incur high research and development costs. To accelerate the early development stage of LNPs, we propose TransLNP, a transformer-based transfection prediction model designed to aid in the selection of LNPs for mRNA drug delivery systems. TransLNP uses two types of molecular information to perceive the relationship between structure and transfection efficiency: coarse-grained atomic sequence information and fine-grained atomic spatial relationship information. Due to the scarcity of existing LNPs experimental data, we find that pretraining the molecular model is crucial for better understanding the task of predicting LNPs properties, which is achieved through reconstructing atomic 3D coordinates and masking atom predictions. In addition, the issue of data imbalance is particularly prominent in the real-world exploration of LNPs. We introduce the BalMol block to solve this problem by smoothing the distribution of labels and molecular features. Our approach outperforms state-of-the-art works in transfection property prediction under both random and scaffold data splitting. Additionally, we establish a relationship between molecular structural similarity and transfection differences, selecting 4267 pairs of molecular transfection cliffs, which are pairs of molecules that exhibit high structural similarity but significant differences in transfection efficiency, thereby revealing the primary source of prediction errors. The code, model and data are made publicly available at https://github.com/wklix/TransLNP.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:在过去的十年中,长尾学习已成为深度学习在医学中应用的热门研究热点。然而,没有科学计量学报告对这一科学领域提供了系统的概述。我们利用文献计量技术来识别和分析长尾学习在医学深度学习应用中的文献,并调查研究趋势。核心作者,和核心期刊。我们扩展了对医学领域长尾学习研究的主要组成部分和主要方法的理解。
    方法:WebofScience被用来收集直到2023年12月出版的所有关于医学长尾学习的文章。评估了所有检索到的标题和摘要的适用性。对于文献计量分析,提取了所有数值数据。CiteSpace用于基于关键字创建集群和视觉知识图。
    结果:共579篇文章符合评价标准。在过去的十年里,年度出版物数量和引用频率均显示出显着增长,遵循幂律和指数趋势,分别。这一领域值得注意的贡献者包括HusanbirSinghPannu,FadiThabtah,还有TalhaMahboobAlam,在IEEEACCESS等领先期刊上,生物学和医学计算机,IEEE医学成像事务,计算机医学图像和图形已成为传播该领域研究的关键平台。医学领域长尾学习研究的核心包含六个主要主题:不平衡数据的深度学习,模型优化,图像分析中的神经网络,健康记录中的数据不平衡,CNN在诊断和风险评估中,和疾病机制中的遗传信息。
    结论:本研究通过文献计量分析和可视化知识图总结了将长尾学习应用于医学深度学习的最新进展。它解释了新趋势,来源,核心作者,期刊,和研究热点。尽管这一领域在医学深度学习研究中显示出巨大的前景,我们的研究结果将为未来的研究和临床实践提供有价值的见解.
    BACKGROUND: In the last decade, long-tail learning has become a popular research focus in deep learning applications in medicine. However, no scientometric reports have provided a systematic overview of this scientific field. We utilized bibliometric techniques to identify and analyze the literature on long-tailed learning in deep learning applications in medicine and investigate research trends, core authors, and core journals. We expanded our understanding of the primary components and principal methodologies of long-tail learning research in the medical field.
    METHODS: Web of Science was utilized to collect all articles on long-tailed learning in medicine published until December 2023. The suitability of all retrieved titles and abstracts was evaluated. For bibliometric analysis, all numerical data were extracted. CiteSpace was used to create clustered and visual knowledge graphs based on keywords.
    RESULTS: A total of 579 articles met the evaluation criteria. Over the last decade, the annual number of publications and citation frequency both showed significant growth, following a power-law and exponential trend, respectively. Noteworthy contributors to this field include Husanbir Singh Pannu, Fadi Thabtah, and Talha Mahboob Alam, while leading journals such as IEEE ACCESS, COMPUTERS IN BIOLOGY AND MEDICINE, IEEE TRANSACTIONS ON MEDICAL IMAGING, and COMPUTERIZED MEDICAL IMAGING AND GRAPHICS have emerged as pivotal platforms for disseminating research in this area. The core of long-tailed learning research within the medical domain is encapsulated in six principal themes: deep learning for imbalanced data, model optimization, neural networks in image analysis, data imbalance in health records, CNN in diagnostics and risk assessment, and genetic information in disease mechanisms.
    CONCLUSIONS: This study summarizes recent advancements in applying long-tail learning to deep learning in medicine through bibliometric analysis and visual knowledge graphs. It explains new trends, sources, core authors, journals, and research hotspots. Although this field has shown great promise in medical deep learning research, our findings will provide pertinent and valuable insights for future research and clinical practice.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    背景:医学图像配准在几种应用中起着重要作用。由于数据不平衡问题,使用无监督学习的现有方法遇到了问题,因为它们的目标通常是连续变量。
    目的:在本研究中,我们介绍了一种称为无监督不平衡配准的新方法,解决数据不平衡的挑战,防止过度自信,同时提高4D图像配准的准确性和稳定性。
    方法:我们的方法涉及执行无监督图像混合以平滑输入空间,然后是无监督的图像配准,以学习连续目标。我们使用两种广泛使用的无监督方法评估了我们在4D-Lung上的方法,即VoxelMorph和ViT-V-Net。
    结果:我们的发现表明,我们提出的方法在小数据集上显着将配准的平均精度提高了3%-10%,同时还将精度方差降低了10%。
    结论:无监督不平衡配准是一种有前途的方法,与当前应用于4D图像的无监督图像配准方法兼容。
    BACKGROUND: Medical image registration plays an important role in several applications. Existing approaches using unsupervised learning encounter issues due to the data imbalance problem, as their target is usually a continuous variable.
    OBJECTIVE: In this study, we introduce a novel approach known as Unsupervised Imbalanced Registration, to address the challenge of data imbalance and prevent overconfidence while increasing the accuracy and stability of 4D image registration.
    METHODS: Our approach involves performing unsupervised image mixtures to smooth the input space, followed by unsupervised image registration to learn the continual target. We evaluated our method on 4D-Lung using two widely used unsupervised methods, namely VoxelMorph and ViT-V-Net.
    RESULTS: Our findings demonstrate that our proposed method significantly enhances the mean accuracy of registration by 3%-10% on a small dataset while also reducing the accuracy variance by 10%.
    CONCLUSIONS: Unsupervised Imbalanced Registration is a promising approach that is compatible with current unsupervised image registration methods applied to 4D images.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    背景:在COVID-19大流行之后,有限的精神卫生保健资源与快速增长的患者数量之间的冲突变得更加明显。心理学家有必要借用基于人工智能(AI)的方法来分析接受精神疾病治疗的患者对药物治疗的满意度。
    目的:我们的目标是通过分析精神疾病患者对药物摄入的经验和评论,构建高度准确和可转移的模型来预测他们对药物的满意度。
    方法:我们从16,950个疾病类别的161,297条评论的大型公开数据集中,提取了与精神疾病相关的20种疾病类别的41,851条评论。为了发现自然语言处理模型的更优化结构,我们提出了统一的可互换模型融合来分解来自变压器(BERT)的最先进的双向编码器表示,支持向量机,和随机森林(RF)模型分为2个模块,编码器和分类器,然后重建融合的“编码器+分类器”模型,以准确评估患者的满意度。根据模型结构,融合模型分为两类,传统的基于机器学习的模型和基于神经网络的模型。针对这些基于神经网络的模型,提出了一种新的损失函数,以克服过拟合和数据不平衡的问题。最后,我们对融合模型进行了微调,并根据F1得分对其性能进行了全面评估,准确度,κ系数,和使用10倍交叉验证的训练时间。
    结果:通过广泛的实验,变压器双向编码器+RF模型优于最先进的BERT,MentalBERT,和其他融合模型。它成为预测患者对药物治疗满意度的最佳模型。它的平均F1评分为0.872,准确率为0.873,κ系数为0.806。该模型适用于拥有充足计算资源的高标准用户。或者,事实证明,单词嵌入编码器RF模型显示出相对较好的性能,平均F1评分为0.801,精度为0.812,κ系数为0.695,但训练时间要少得多。它可以部署在计算资源有限的环境中。
    结论:我们分析了支持向量机的性能,射频,BERT,MentalBERT,和所有融合模型,并确定了不同临床场景的最佳模型。这些发现可以作为证据,支持自然语言处理方法可以有效地帮助心理学家评估患者对药物治疗计划的满意度,并提供精确和标准化的解决方案。统一的可互换模型融合为构建心理健康的AI模型提供了不同的视角,并有可能将模型的不同组件的优势融合到单个模型中,这可能有助于AI在心理健康方面的发展。
    BACKGROUND: After the COVID-19 pandemic, the conflict between limited mental health care resources and the rapidly growing number of patients has become more pronounced. It is necessary for psychologists to borrow artificial intelligence (AI)-based methods to analyze patients\' satisfaction with drug treatment for those undergoing mental illness treatment.
    OBJECTIVE: Our goal was to construct highly accurate and transferable models for predicting the satisfaction of patients with mental illness with medication by analyzing their own experiences and comments related to medication intake.
    METHODS: We extracted 41,851 reviews in 20 categories of disorders related to mental illnesses from a large public data set of 161,297 reviews in 16,950 illness categories. To discover a more optimal structure of the natural language processing models, we proposed the Unified Interchangeable Model Fusion to decompose the state-of-the-art Bidirectional Encoder Representations from Transformers (BERT), support vector machine, and random forest (RF) models into 2 modules, the encoder and the classifier, and then reconstruct fused \"encoder+classifer\" models to accurately evaluate patients\' satisfaction. The fused models were divided into 2 categories in terms of model structures, traditional machine learning-based models and neural network-based models. A new loss function was proposed for those neural network-based models to overcome overfitting and data imbalance. Finally, we fine-tuned the fused models and evaluated their performance comprehensively in terms of F1-score, accuracy, κ coefficient, and training time using 10-fold cross-validation.
    RESULTS: Through extensive experiments, the transformer bidirectional encoder+RF model outperformed the state-of-the-art BERT, MentalBERT, and other fused models. It became the optimal model for predicting the patients\' satisfaction with drug treatment. It achieved an average graded F1-score of 0.872, an accuracy of 0.873, and a κ coefficient of 0.806. This model is suitable for high-standard users with sufficient computing resources. Alternatively, it turned out that the word-embedding encoder+RF model showed relatively good performance with an average graded F1-score of 0.801, an accuracy of 0.812, and a κ coefficient of 0.695 but with much less training time. It can be deployed in environments with limited computing resources.
    CONCLUSIONS: We analyzed the performance of support vector machine, RF, BERT, MentalBERT, and all fused models and identified the optimal models for different clinical scenarios. The findings can serve as evidence to support that the natural language processing methods can effectively assist psychologists in evaluating the satisfaction of patients with drug treatment programs and provide precise and standardized solutions. The Unified Interchangeable Model Fusion provides a different perspective on building AI models in mental health and has the potential to fuse the strengths of different components of the models into a single model, which may contribute to the development of AI in mental health.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    为了解决异常心电图(ECG)数据库的稀缺和类别不平衡,这在人工智能驱动的诊断工具中至关重要,用于潜在的心血管疾病检测,这项研究提出了一种新颖的量子条件生成对抗算法(QCGAN-ECG),用于生成异常的ECG信号。QCGAN-ECG构建了基于补丁方法的量子发生器。在这种方法中,每个子生成器生成不同段中异常心跳的不同特征。这种基于补丁的生成算法节省了量子资源,并使QCGAN-ECG适用于近期量子设备。此外,QCGAN-ECG引入量子寄存器作为控制条件。它将有关异常心跳的类型和概率分布的信息编码到量子寄存器中,渲染整个生成过程可控。Pennylane的模拟实验表明,QCGAN-ECG可以产生完全异常的心跳,平均准确率为88.8%。此外,QCGAN-ECG可以准确拟合各种异常ECG数据的概率分布。在抗噪声实验中,QCGAN-ECG在各种级别的量子噪声干扰中展示了出色的鲁棒性。这些结果证明了QCGAN-ECG产生异常ECG信号的有效性和潜在适用性,这将进一步促进人工智能驱动的心脏病诊断系统的发展。源代码可在github.com/VanSWK/QCGAN_ECG获得。
    To address the scarcity and class imbalance of abnormal electrocardiogram (ECG) databases, which are crucial in AI-driven diagnostic tools for potential cardiovascular disease detection, this study proposes a novel quantum conditional generative adversarial algorithm (QCGAN-ECG) for generating abnormal ECG signals. The QCGAN-ECG constructs a quantum generator based on patch method. In this method, each sub-generator generates distinct features of abnormal heartbeats in different segments. This patch-based generative algorithm conserves quantum resources and makes QCGAN-ECG practical for near-term quantum devices. Additionally, QCGAN-ECG introduces quantum registers as control conditions. It encodes information about the types and probability distributions of abnormal heartbeats into quantum registers, rendering the entire generative process controllable. Simulation experiments on Pennylane demonstrated that the QCGAN-ECG could generate completely abnormal heartbeats with an average accuracy of 88.8%. Moreover, the QCGAN-ECG can accurately fit the probability distribution of various abnormal ECG data. In the anti-noise experiments, the QCGAN-ECG showcased outstanding robustness across various levels of quantum noise interference. These results demonstrate the effectiveness and potential applicability of the QCGAN-ECG for generating abnormal ECG signals, which will further promote the development of AI-driven cardiac disease diagnosis systems. The source code is available at github.com/VanSWK/QCGAN_ECG.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    大型医院可能很复杂,拥有众多的学科和亚专业设置。患者的医学知识可能有限,使他们难以确定访问哪个部门。因此,访问错误的部门和不必要的任命是常见的。为了解决这个问题,现代医院需要一个能够进行智能分诊的远程系统,使患者能够进行自助分诊。为了应对上述挑战,本研究提出了一种基于迁移学习的智能分诊系统,能够处理多标签神经医学文本。系统基于患者的输入来预测诊断和相应的科室。它利用分诊优先级(TP)方法来标记医疗记录中的诊断组合,将多标签问题转换为单标签问题。系统考虑疾病严重程度并减少数据集的“类重叠”。BERT模型对主要投诉文本进行了分类,预测与投诉相对应的初步诊断。为了解决数据不平衡,在BERT架构中添加了基于成本敏感学习的复合损失函数。研究结果表明,TP方法对病历文本的分类准确率达到87.47%,优于其他问题转换方法。通过合并复合损失函数,系统的准确率提高到88.38%,超过其他损失函数。与传统方法相比,这个系统不会引入显著的复杂性,但大大提高了分诊的准确性,减少患者输入混乱,增强医院分诊能力,最终改善患者的医疗体验。研究结果可为智能分诊开发提供参考。
    Large hospitals can be complex, with numerous discipline and subspecialty settings. Patients may have limited medical knowledge, making it difficult for them to determine which department to visit. As a result, visits to the wrong departments and unnecessary appointments are common. To address this issue, modern hospitals require a remote system capable of performing intelligent triage, enabling patients to perform self-service triage. To address the challenges outlined above, this study presents an intelligent triage system based on transfer learning, capable of processing multilabel neurological medical texts. The system predicts a diagnosis and corresponding department based on the patient\'s input. It utilizes the triage priority (TP) method to label diagnostic combinations found in medical records, converting a multilabel problem into a single-label one. The system considers disease severity and reduces the \"class overlapping\" of the dataset. The BERT model classifies the chief complaint text, predicting a primary diagnosis corresponding to the complaint. To address data imbalance, a composite loss function based on cost-sensitive learning is added to the BERT architecture. The study results indicate that the TP method achieves a classification accuracy of 87.47% on medical record text, outperforming other problem transformation methods. By incorporating the composite loss function, the system\'s accuracy rate improves to 88.38% surpassing other loss functions. Compared to traditional methods, this system does not introduce significant complexity, yet substantially improves triage accuracy, reduces patient input confusion, and enhances hospital triage capabilities, ultimately improving the patient\'s medical experience. The findings could provide a reference for intelligent triage development.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    道路交通事故是交通安全管理的一大难题,这通常会导致快速人群流量,对交通管理和通信系统产生深远的影响。2020年,新型冠状病毒病(COVID-19)大流行的突然爆发导致道路交通状况发生了重大变化。在本文中,通过分析2016年至2020年的撞车数据和2020年新的COVID-19病例数据,我们发现这一时期的平均撞车严重程度和撞车死亡人数(2020年新的COVID-19病例迅速增加)高于前四年。因此,有必要针对此类紧急情况开发一种新颖的道路碰撞风险预测模型。我们提出了一种新颖的数据自适应疲劳聚焦损失(DA-FFL)方法,通过融合疲劳因子来建立大规模紧急情况下的道路碰撞风险预测模型。最后,实验结果表明,在不平衡数据的曲线下面积(AUC)和误报率(FAR)方面,DA-FFL的性能优于其他典型方法。此外,DA-FFL在卷积神经网络-长短期记忆(CNN-LSTM)中具有更好的预测性能。
    Road crashes are a major problem for traffic safety management, which usually causes flash crowd traffic with a profound influence on traffic management and communication systems. In 2020, the sudden outbreak of the novel coronavirus disease (COVID-19) pandemic led to significant changes in road traffic conditions. In this paper, by analyzing crash data from 2016 to 2020 and new COVID-19 case data in 2020, we find that the average crash severity and crash deaths during this period (a rapid increase of new COVID-19 cases in 2020) are higher than those in previous four years. Hence, it is necessary to exploit a novel road crash risk prediction model for such an emergency. We propose a novel data-adaptive fatigue focal loss (DA-FFL) method by fusing fatigue factors to establish a road crash risk prediction model under the scenario of large-scale emergencies. Finally, the experimental results demonstrate that DA-FFL performs better than the other typical methods in terms of area under curve (AUC) and false alarm rate (FAR) for imbalanced data. Furthermore, DA-FFL has better prediction performance in convolutional neural networks-long short-term memory (CNN-LSTM).
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    药物-药物相互作用(DDIs)的提取是生物医学研究领域的一项重要任务,这可以减少患者治疗期间意外的健康风险。先前的工作表明,使用外部药物信息的方法比不使用它的方法具有更高的性能。然而,使用外部药物信息既耗时又耗费资源。在这项工作中,我们提出了一种新的方法来提取DDI,该方法不使用外部药物信息,但仍然实现了可比的性能。首先,我们不再将药物名称转换为标准标记,例如DRUG0,这是以前研究中常用的方法。相反,带有药物实体标记的药物全名被输入到Biobert,允许我们增强选定的药物实体对。第二,我们采用关键语义句方法来强调与所选药物对的DDI关系密切相关的单词。经过上述步骤,从相同句子创建但对应于不同的药物实体对的相似实例的错误分类可以显著减少。然后,我们采用梯度协调机制(GHM)损失来减少错误标记的实例和易于分类的实例的重量,这两者都可能导致DDI提取性能不佳。总的来说,我们在这项工作中证明,最好不要使用Biobert药物致盲,并且表明,如果标签噪声的比例小于30%,则GHM的性能优于交叉熵损失。所提出的模型在DDIstraction2013语料库(标准英语DDI语料库)上获得了84.13%的F1分数,这填补了依赖和不依赖外部药物信息的方法之间的性能差距(4%)。
    The extraction of drug-drug interactions (DDIs) is an important task in the field of biomedical research, which can reduce unexpected health risks during patient treatment. Previous work indicates that methods using external drug information have a much higher performance than those methods not using it. However, the use of external drug information is time-consuming and resource-costly. In this work, we propose a novel method for extracting DDIs which does not use external drug information, but still achieves comparable performance. First, we no longer convert the drug name to standard tokens such as DRUG0, the method commonly used in previous research. Instead, full drug names with drug entity marking are input to BioBERT, allowing us to enhance the selected drug entity pair. Second, we adopt the Key Semantic Sentence approach to emphasize the words closely related to the DDI relation of the selected drug pair. After the above steps, the misclassification of similar instances which are created from the same sentence but corresponding to different pairs of drug entities can be significantly reduced. Then, we employ the Gradient Harmonizing Mechanism (GHM) loss to reduce the weight of mislabeled instances and easy-to-classify instances, both of which can lead to poor performance in DDI extraction. Overall, we demonstrate in this work that it is better not to use drug blinding with BioBERT, and show that GHM performs better than Cross-Entropy loss if the proportion of label noise is less than 30%. The proposed model achieves state-of-the-art results with an F1-score of 84.13% on the DDIExtraction 2013 corpus (a standard English DDI corpus), which fills the performance gap (4%) between methods that rely on and do not rely on external drug information.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

公众号