Mesh : Humans Longitudinal Studies Data Analysis Artificial Intelligence

来  源:   DOI:10.1038/s41598-024-62102-2   PDF(Pubmed)

Abstract:
Access to individual-level health data is essential for gaining new insights and advancing science. In particular, modern methods based on artificial intelligence rely on the availability of and access to large datasets. In the health sector, access to individual-level data is often challenging due to privacy concerns. A promising alternative is the generation of fully synthetic data, i.e., data generated through a randomised process that have similar statistical properties as the original data, but do not have a one-to-one correspondence with the original individual-level records. In this study, we use a state-of-the-art synthetic data generation method and perform in-depth quality analyses of the generated data for a specific use case in the field of nutrition. We demonstrate the need for careful analyses of synthetic data that go beyond descriptive statistics and provide valuable insights into how to realise the full potential of synthetic datasets. By extending the methods, but also by thoroughly analysing the effects of sampling from a trained model, we are able to largely reproduce significant real-world analysis results in the chosen use case.
摘要:
获取个人层面的健康数据对于获得新见解和推进科学发展至关重要。特别是,基于人工智能的现代方法依赖于对大型数据集的可用性和访问。在卫生部门,由于隐私问题,访问个人级别的数据通常具有挑战性。一个有希望的替代方案是生成完全合成的数据,即,通过与原始数据具有相似统计特性的随机过程生成的数据,但与原始的个人级别记录没有一对一的对应关系。在这项研究中,我们使用最先进的合成数据生成方法,并针对营养领域的特定用例对生成的数据进行深入的质量分析。我们证明了需要对超越描述性统计的合成数据进行仔细分析,并为如何实现合成数据集的全部潜力提供有价值的见解。通过扩展方法,而且通过彻底分析从训练模型中采样的效果,我们能够在所选用例中大量重现重要的现实分析结果。
公众号