关键词: Accuracy Directly identifiable information Indirectly identifiable information Linkage levels

Mesh : Humans Big Data Medical Record Linkage Female Biomedical Research Male Empirical Research

来  源:   DOI:10.1186/s12911-024-02586-0   PDF(Pubmed)

Abstract:
BACKGROUND: Linkage errors that occur according to linkage levels can adversely affect the accuracy and reliability of analysis results. This study aimed to identify the differences in results according to personally identifiable information linkage level, sample size, and analysis methods through empirical analysis.
METHODS: The difference between the results of linkage in directly identifiable information (DII) and indirectly identifiable information (III) linkage levels was set as III linkage based on name, date of birth, and sex and DII linkage based on resident registration number. The datasets linked at each level were named as databaseIII (DBIII) and databaseDII (DBDII), respectively. Considering the analysis results of the DII-linked dataset as the gold standard, descriptive statistics, group comparison, incidence estimation, treatment effect, and moderation effect analysis results were assessed.
RESULTS: The linkage rates for DBDII and DBIII were 71.1% and 99.7%, respectively. Regarding descriptive statistics and group comparison analysis, the difference in effect in most cases was \"none\" to \"very little.\" With respect to cervical cancer that had a relatively small sample size, analysis of DBIII resulted in an underestimation of the incidence in the control group and an overestimation of the incidence in the treatment group (hazard ratio [HR] = 2.62 [95% confidence interval (CI): 1.63-4.23] in DBIII vs. 1.80 [95% CI: 1.18-2.73] in DBDII). Regarding prostate cancer, there was a conflicting tendency with the treatment effect being over or underestimated according to the surveillance, epidemiology, and end results summary staging (HR = 2.27 [95% CI: 1.91-2.70] in DBIII vs. 1.92 [95% CI: 1.70-2.17] in DBDII for the localized stage; HR = 1.80 [95% CI: 1.37-2.36] in DBIII vs. 2.05 [95% CI: 1.67-2.52] in DBDII for the regional stage).
CONCLUSIONS: To prevent distortion of the analyses results in health and medical research, it is important to check that the patient population and sample size by each factor of interest (FOI) are sufficient when different data are linked using DBDII. In cases involving a rare disease or with a small sample size for FOI, there is a high likelihood that a DII linkage is unavoidable.
摘要:
背景:根据链接级别发生的链接错误会对分析结果的准确性和可靠性产生不利影响。本研究旨在根据个人身份信息关联水平识别结果的差异,样本量,和分析方法,通过实证分析。
方法:将直接可识别信息(DII)和间接可识别信息(III)链接级别的链接结果之间的差异设置为基于名称的III链接,出生日期,以及基于居民登记号的性别和DII联系。在每个级别链接的数据集被命名为数据库III(DBIII)和数据库DII(DBDII),分别。考虑到DII链接数据集的分析结果作为黄金标准,描述性统计,分组比较,发病率估计,治疗效果,并对调节效应分析结果进行评估。
结果:DBDII和DBIII的连锁率分别为71.1%和99.7%,分别。关于描述性统计和分组比较分析,在大多数情况下,效果差异是“无”到“很小”。“对于样本量相对较小的宫颈癌,DBIII的分析导致对照组的发病率被低估,而治疗组的发病率被高估(DBIII与DBIII的风险比[HR]=2.62[95%置信区间(CI):1.63-4.23]1.80[95%CI:1.18-2.73],以DBDII计)。关于前列腺癌,根据监测,治疗效果过度或低估的趋势是矛盾的,流行病学,和最终结果总结分期(DBIII与DBIII的HR=2.27[95%CI:1.91-2.70]对于局部阶段,DBDII中的1.92[95%CI:1.70-2.17];DBIII中的HR=1.80[95%CI:1.37-2.36]与区域阶段的DBDII中为2.05[95%CI:1.67-2.52])。
结论:为了防止健康和医学研究中的分析结果失真,重要的是,当使用DBDII关联不同数据时,通过每个感兴趣的因素(FOI)检查患者群体和样本量是否足够.在涉及罕见疾病或FOI样本量小的情况下,很有可能DII关联是不可避免的。
公众号