用于职业研究的大型盲合成化身数据库的灵活性：来自 CONSTANCES 中风和膝盖疼痛队列的示例。Flexibility of a large blindly synthetized avatar database for occupational research: Example from the CONSTANCES cohort for stroke and knee pain.-医云文献数字医云科研云海量医学决策数据服务

Abstract：

OBJECTIVE: Though the rise of big data in the field of occupational health offers new opportunities especially for cross-cutting research, they raise the issue of privacy and security of data, especially when linking sensitive data from the field of insurance, occupational health or compensation claims. We aimed to validate a large, blinded synthesized database developed from the CONSTANCES cohort by comparing associations between three independently selected outcomes, and various exposures.
METHODS: From the CONSTANCES cohort, a large synthetic dataset was constructed using the avatar method (Octopize) that is agnostic to the data primary or secondary data uses. Three main analyses of interest were chosen to compare associations between the raw and avatar dataset: risk of stroke (any stroke, and subtypes of stroke), risk of knee pain and limitations associated with knee pain. Logistic models were computed, and a qualitative comparison of paired odds ratio (OR) was made.
RESULTS: Both raw and avatar datasets included 162,434 observations and 19 relevant variables. On the 172 paired raw/avatar OR that were computed, including stratified analyses on sex, more than 77% of the comparisons had a OR difference ≤0.5 and less than 7% had a discrepancy in the statistical significance of the associations, with a Cohen\'s Kappa coefficient of 0.80.
CONCLUSIONS: This study shows the flexibility and the multiple usage of a synthetic database created with the avatar method in the particular field of occupational health, which can be shared in open access without risking re-identification and privacy issues and help bring new insights for complex phenomenon like return to work.

摘要：

目标：尽管大数据在职业卫生领域的兴起为跨领域研究提供了新的机会，他们提出了数据的隐私和安全问题，特别是在连接保险领域的敏感数据时，职业健康或赔偿要求。我们的目标是验证一个大型的，通过比较三个独立选择的结局之间的关联，从CONSTANCES队列开发的盲合成数据库，和各种曝光。
方法：从CONSTANCES队列中，使用头像方法(Octopize)构建了一个大型合成数据集,该数据集对于主要或次要数据用途是不可知的.选择了三个主要的兴趣分析来比较原始数据集和化身数据集之间的关联：中风的风险（任何中风，和中风的亚型)，膝关节疼痛的风险和与膝关节疼痛相关的局限性。计算了Logistic模型，并对配对比值比(OR)进行了定性比较。
结果：原始和头像数据集都包括162,434个观察值和19个相关变量。在计算的172个配对的原始/化身OR上，包括性别的分层分析，超过77％的比较有OR差异≤0.5，少于7％的比较有统计学意义的关联差异，科恩的卡帕系数为0.80。
结论：这项研究显示了在特定的职业健康领域中使用化身方法创建的合成数据库的灵活性和多种用法，可以在开放获取中共享，而不会冒重新识别和隐私问题的风险，并有助于为重返工作岗位等复杂现象带来新的见解。