关键词: COVID-19 SARS-CoV-2 data paper data privacy de-identification open data

Mesh : COVID-19 / epidemiology Centers for Disease Control and Prevention, U.S. / organization & administration standards Confidentiality / standards Data Anonymization / standards Humans Pandemics SARS-CoV-2 United States / epidemiology

来  源:   DOI:10.1177/00333549211026817   PDF(Sci-hub)   PDF(Pubmed)

Abstract:
Federal open-data initiatives that promote increased sharing of federally collected data are important for transparency, data quality, trust, and relationships with the public and state, tribal, local, and territorial partners. These initiatives advance understanding of health conditions and diseases by providing data to researchers, scientists, and policymakers for analysis, collaboration, and use outside the Centers for Disease Control and Prevention (CDC), particularly for emerging conditions such as COVID-19, for which data needs are constantly evolving. Since the beginning of the pandemic, CDC has collected person-level, de-identified data from jurisdictions and currently has more than 8 million records. We describe how CDC designed and produces 2 de-identified public datasets from these collected data.
We included data elements based on usefulness, public request, and privacy implications; we suppressed some field values to reduce the risk of re-identification and exposure of confidential information. We created datasets and verified them for privacy and confidentiality by using data management platform analytic tools and R scripts.
Unrestricted data are available to the public through Data.CDC.gov, and restricted data, with additional fields, are available with a data-use agreement through a private repository on GitHub.com.
Enriched understanding of the available public data, the methods used to create these data, and the algorithms used to protect the privacy of de-identified people allow for improved data use. Automating data-generation procedures improves the volume and timeliness of sharing data.
摘要:
促进增加联邦收集数据共享的联邦开放数据倡议对于透明度很重要,数据质量,信任,以及与公众和国家的关系,部落,当地,领土伙伴。这些举措通过向研究人员提供数据来促进对健康状况和疾病的理解,科学家,和决策者进行分析,合作,并在疾病控制和预防中心(CDC)之外使用,特别是对于像COVID-19这样的新兴条件,数据需求在不断变化。自从大流行开始以来,疾控中心已经收集了个人级别,来自司法管辖区的去识别数据,目前拥有超过800万条记录。我们描述了CDC如何从这些收集的数据中设计和产生2个去识别的公共数据集。
我们包含了基于有用性的数据元素,公开请求,和隐私影响;我们抑制了一些字段值,以减少重新识别和暴露机密信息的风险。我们创建了数据集,并通过使用数据管理平台分析工具和R脚本验证了它们的隐私和机密性。
不受限制的数据通过数据向公众开放。CDC.gov,和受限制的数据,使用其他字段,可以通过GitHub.com上的私人存储库来提供数据使用协议。
丰富了对现有公共数据的理解,用于创建这些数据的方法,以及用于保护去识别的人的隐私的算法允许改进数据使用。自动化数据生成过程提高了共享数据的数量和及时性。
公众号