使用加密线性模型对同态加密基因型数据进行安全推断。Secure Inference on Homomorphically Encrypted Genotype Data with Encrypted Linear Models.-医云文献数字医云科研云海量医学决策数据服务

Abstract：

UNASSIGNED: Accurate models are crucial to estimate the phenotypes from high throughput genomic data. While the genetic and phenotypic data are sensitive, secure models are essential to protect the private information. Therefore, construct an accurate and secure model is significant in secure inference of phenotypes.
UNASSIGNED: We propose a secure inference protocol on homomorphically encrypted genotype data with encrypted linear models. Firstly, scale the genotype data by feature importance with Xgboost or Adaboost then train linear models to predict the phenotypes in plaintext. Secondly, encrypt the model parameters and test data with CKKS scheme for secure inference. Thirdly, predict the phenotypes under CKKS homomorphically encryption computation. Finally, decrypt the encrypted predictions by client to compute the 1-NRMSE/AUC for model evaluation.
UNASSIGNED: 5 phenotypes of 3000 samples with 20390 variants are used to validate the performance of the secure inference protocol. The protocol achieves 0.9548, 0.9639, 0.9673 (1-NRMSE) for 3 continuous phenotypes and 0.9943, 0.99290 (AUC) for 2 category phenotypes in test data. Moreover, the protocol shows robust in 100 times of random sampling. Furthermore, the protocol achieves 0.9725 (the average accuracy) in an encrypted test set with 198 samples, and it only takes 4.32s for the overall inference. These help the protocol rank top one in the iDASH-2022 track2 challenge.
UNASSIGNED: We propose an accurate and secure protocol to predict the phenotype from genotype and it takes seconds to obtain hundreds of predictions for all phenotypes.

摘要：

背景：准确的模型对于从高通量基因组数据中估计表型至关重要。虽然遗传和表型数据是敏感的，安全模型对于保护私人信息至关重要。因此,构建准确、安全的模型对表型的安全推断具有重要意义。方法：我们提出了一种具有加密线性模型的同态加密基因型数据的安全推理协议。首先,用Xgboost或Adaboost按特征重要性缩放基因型数据，然后训练线性模型以明文预测表型。其次,使用CKKS方案对模型参数和测试数据进行加密，以进行安全推断。第三，预测CKKS同态加密计算下的表型。最后,客户端对加密的预测进行解密，以计算1-NRMSE/AUC，用于模型评估。结果:使用具有20390个变体的3000个样品的5个表型来验证安全推断协议的性能。该方案在测试数据中实现了3种连续表型的0.9548、0.9639、0.9673（1-NRMSE）和2种类别表型的0.9943、0.99290（AUC）。此外，该方案在100次随机抽样中显示出鲁棒性。此外，该协议在198个样本的加密测试集中达到0.9725（平均准确度），它只需要4.32s的整体推理。这些有助于该协议在iDASH-2022track2挑战中排名第一。结论：我们提出了一种准确且安全的协议来预测基因型的表型，并且需要几秒钟才能获得所有表型的数百个预测。