背景:电子健康记录是患者信息的宝贵来源,在与研究人员共享之前,必须对其进行适当的识别。这个过程需要专业知识和时间。此外,合成数据大大减少了对实际数据使用和共享的限制,允许研究人员以更少的隐私限制更快地访问它。因此,人们对建立一种生成合成数据的方法越来越感兴趣,该方法可以保护患者的隐私,同时正确反映数据。
目的:本研究旨在开发和验证一种模型,该模型可生成有价值的合成纵向健康数据,同时保护收集数据的患者的隐私。
方法:我们研究了生成综合健康数据的最佳模型,专注于纵向观察。我们开发了一个生成模型,该模型依赖于广义规范多元(GCP)张量分解。该模型还涉及从GCP分解的潜在因子矩阵中进行采样,其中包含患者因素,使用顺序决策树,copula,和哈密顿蒙特卡罗方法。我们将所提出的模型应用于来自MIMIC-III(版本1.4)数据集的样本。使用不同的数据结构和场景进行了许多分析和实验。我们通过进行效用评估来评估我们的合成数据和真实数据之间的相似性。这些评估评估了数据中存在的结构和一般模式,如依赖结构,描述性统计,和边际分布。关于隐私披露,我们的模型通过防止患者信息的直接共享和消除观察张量记录和模型张量记录之间的一对一联系来保护隐私.这是通过模拟和建模与患者相关的GCP分解的潜在因子矩阵来实现的。
结果:研究结果表明,我们的模型是一种有前途的方法,用于生成与真实数据足够相似的合成纵向健康数据。它可以保护原始数据的效用和隐私,同时还可以处理各种数据结构和场景。在某些实验中,模型中使用的所有仿真方法都产生了相同的高水平性能。我们的模型还能够解决从电子健康记录中采样患者的挑战。这意味着我们可以在合成数据集中模拟各种患者,这可能与原始数据中的患者数量不同。
结论:我们提出了一种生成综合纵向健康数据的生成模型。通过应用GCP张量分解来建立模型。我们已经提供了3种方法,用于在分解过程之后合成和模拟潜在因子矩阵。简而言之,我们已经将合成大量纵向健康数据的挑战减少到合成非纵向且明显较小的数据集。
BACKGROUND: Electronic health records are a valuable source of patient information that must be properly deidentified before being shared with researchers. This process requires expertise and time. In addition, synthetic data have considerably reduced the restrictions on the use and sharing of real data, allowing researchers to access it more rapidly with far fewer privacy constraints. Therefore, there has been a growing interest in establishing a method to generate synthetic data that protects patients\' privacy while properly reflecting the data.
OBJECTIVE: This study aims to develop and validate a model that generates valuable synthetic longitudinal health data while protecting the privacy of the patients whose data are collected.
METHODS: We investigated the best model for generating synthetic health data, with a focus on longitudinal observations. We developed a generative model that relies on the generalized canonical polyadic (GCP) tensor decomposition. This model also involves sampling from a latent factor matrix of GCP decomposition, which contains patient factors, using sequential decision trees, copula, and Hamiltonian Monte Carlo methods. We applied the proposed model to samples from the MIMIC-III (version 1.4) data set. Numerous analyses and experiments were conducted with different data structures and scenarios. We assessed the similarity between our synthetic data and the real data by conducting utility assessments. These assessments evaluate the structure and general patterns present in the data, such as dependency structure, descriptive statistics, and marginal distributions. Regarding privacy disclosure, our model preserves privacy by preventing the direct sharing of patient information and eliminating the one-to-one link between the observed and model tensor records. This was achieved by simulating and modeling a latent factor matrix of GCP decomposition associated with patients.
RESULTS: The findings show that our model is a promising method for generating synthetic longitudinal health data that is similar enough to real data. It can preserve the utility and privacy of the original data while also handling various data structures and scenarios. In certain experiments, all simulation methods used in the model produced the same high level of performance. Our model is also capable of addressing the challenge of sampling patients from electronic health records. This means that we can simulate a variety of patients in the synthetic data set, which may differ in number from the patients in the original data.
CONCLUSIONS: We have presented a generative model for producing synthetic longitudinal health data. The model is formulated by applying the GCP tensor decomposition. We have provided 3 approaches for the synthesis and simulation of a latent factor matrix following the process of factorization. In brief, we have reduced the challenge of synthesizing massive longitudinal health data to synthesizing a nonlongitudinal and significantly smaller data set.