关键词: BERT Clinical text De-identification Electronic health records Language models Natural language processing Privacy preservation Pseudonymization Swedish

Mesh : Natural Language Processing Humans Privacy Sweden Anonyms and Pseudonyms Computer Security / standards Confidentiality / standards Electronic Health Records / standards

来  源:   DOI:10.1186/s12911-024-02546-8   PDF(Pubmed)

Abstract:
Many state-of-the-art results in natural language processing (NLP) rely on large pre-trained language models (PLMs). These models consist of large amounts of parameters that are tuned using vast amounts of training data. These factors cause the models to memorize parts of their training data, making them vulnerable to various privacy attacks. This is cause for concern, especially when these models are applied in the clinical domain, where data are very sensitive. Training data pseudonymization is a privacy-preserving technique that aims to mitigate these problems. This technique automatically identifies and replaces sensitive entities with realistic but non-sensitive surrogates. Pseudonymization has yielded promising results in previous studies. However, no previous study has applied pseudonymization to both the pre-training data of PLMs and the fine-tuning data used to solve clinical NLP tasks. This study evaluates the effects on the predictive performance of end-to-end pseudonymization of Swedish clinical BERT models fine-tuned for five clinical NLP tasks. A large number of statistical tests are performed, revealing minimal harm to performance when using pseudonymized fine-tuning data. The results also find no deterioration from end-to-end pseudonymization of pre-training and fine-tuning data. These results demonstrate that pseudonymizing training data to reduce privacy risks can be done without harming data utility for training PLMs.
摘要:
自然语言处理(NLP)中的许多最新结果依赖于大型预训练语言模型(PLM)。这些模型由大量参数组成,这些参数使用大量训练数据进行调整。这些因素导致模型记忆部分训练数据,使他们容易受到各种隐私攻击。这令人担忧,特别是当这些模型应用于临床领域时,数据非常敏感。训练数据假名化是旨在缓解这些问题的隐私保护技术。此技术会自动识别敏感实体,并将其替换为现实但不敏感的代理。在先前的研究中,假名化已产生了有希望的结果。然而,以前没有研究对PLM的训练前数据和用于解决临床NLP任务的微调数据应用假名.这项研究评估了针对五个临床NLP任务进行微调的瑞典临床BERT模型的端到端假名化预测性能的影响。进行了大量的统计检验,在使用假名微调数据时显示对性能的最小危害。结果也没有发现预训练和微调数据的端到端假名化的恶化。这些结果表明,可以在不损害训练PLM的数据效用的情况下,对训练数据进行假名化以降低隐私风险。
公众号