关键词: anonymization anonymized confidentiality data science data sharing deidentification identification medical informatics privacy privacy-enhancing technologies privacy-utility trade-off security

Mesh : Humans Data Anonymization Renal Insufficiency, Chronic / therapy Information Dissemination / methods Algorithms Germany Confidentiality Privacy

来  源:   DOI:10.2196/49445   PDF(Pubmed)

Abstract:
Sharing data from clinical studies can accelerate scientific progress, improve transparency, and increase the potential for innovation and collaboration. However, privacy concerns remain a barrier to data sharing. Certain concerns, such as reidentification risk, can be addressed through the application of anonymization algorithms, whereby data are altered so that it is no longer reasonably related to a person. Yet, such alterations have the potential to influence the data set\'s statistical properties, such that the privacy-utility trade-off must be considered. This has been studied in theory, but evidence based on real-world individual-level clinical data is rare, and anonymization has not broadly been adopted in clinical practice.
The goal of this study is to contribute to a better understanding of anonymization in the real world by comprehensively evaluating the privacy-utility trade-off of differently anonymized data using data and scientific results from the German Chronic Kidney Disease (GCKD) study.
The GCKD data set extracted for this study consists of 5217 records and 70 variables. A 2-step procedure was followed to determine which variables constituted reidentification risks. To capture a large portion of the risk-utility space, we decided on risk thresholds ranging from 0.02 to 1. The data were then transformed via generalization and suppression, and the anonymization process was varied using a generic and a use case-specific configuration. To assess the utility of the anonymized GCKD data, general-purpose metrics (ie, data granularity and entropy), as well as use case-specific metrics (ie, reproducibility), were applied. Reproducibility was assessed by measuring the overlap of the 95% CI lengths between anonymized and original results.
Reproducibility measured by 95% CI overlap was higher than utility obtained from general-purpose metrics. For example, granularity varied between 68.2% and 87.6%, and entropy varied between 25.5% and 46.2%, whereas the average 95% CI overlap was above 90% for all risk thresholds applied. A nonoverlapping 95% CI was detected in 6 estimates across all analyses, but the overwhelming majority of estimates exhibited an overlap over 50%. The use case-specific configuration outperformed the generic one in terms of actual utility (ie, reproducibility) at the same level of privacy.
Our results illustrate the challenges that anonymization faces when aiming to support multiple likely and possibly competing uses, while use case-specific anonymization can provide greater utility. This aspect should be taken into account when evaluating the associated costs of anonymized data and attempting to maintain sufficiently high levels of privacy for anonymized data.
German Clinical Trials Register DRKS00003971; https://drks.de/search/en/trial/DRKS00003971.
RR2-10.1093/ndt/gfr456.
摘要:
背景:共享来自临床研究的数据可以加速科学进步,提高透明度,并增加创新和合作的潜力。然而,隐私问题仍然是数据共享的障碍。某些担忧,如重新识别风险,可以通过匿名化算法的应用来解决,数据被改变,使其不再与一个人合理相关。然而,这种改变有可能影响数据集的统计属性,因此,必须考虑隐私-公用事业的权衡。这已经在理论上进行了研究,但是基于真实世界个体水平的临床数据的证据很少,而匿名化尚未在临床实践中广泛采用。
目的:本研究的目的是通过使用德国慢性肾脏病(GCKD)研究的数据和科学结果,综合评估不同匿名数据的隐私-效用权衡,从而有助于更好地理解现实世界中的匿名化。
方法:本研究提取的GCKD数据集由5217条记录和70个变量组成。遵循两步程序来确定哪些变量构成重新识别风险。为了抓住风险效用空间的很大一部分,我们确定的风险阈值范围为0.02~1.然后通过泛化和抑制对数据进行转换,并且匿名化过程使用通用和特定于用例的配置进行了更改。为了评估匿名GCKD数据的实用性,通用指标(即,数据粒度和熵),以及特定于用例的指标(即,再现性),被应用了。通过测量匿名和原始结果之间95%CI长度的重叠来评估重复性。
结果:通过95%CI重叠测量的重现性高于从通用指标获得的效用。例如,粒度在68.2%和87.6%之间变化,熵在25.5%到46.2%之间变化,而应用的所有风险阈值的平均95%CI重叠均超过90%.在所有分析的6个估计值中检测到不重叠的95%CI,但绝大多数的估计显示重叠超过50%。特定于用例的配置在实际效用方面优于通用配置(即,可再现性)在同一隐私级别。
结论:我们的结果说明了匿名化在旨在支持多种可能和可能竞争的用途时面临的挑战,而特定于用例的匿名化可以提供更大的效用。在评估匿名数据的相关成本并尝试为匿名数据保持足够高的隐私级别时,应考虑到这一方面。
背景:德国临床试验注册DRKS00003971;https://drks。去/搜索/报/审/DRKS00003971.
RR2-10.1093/ndt/gfr456。
公众号