关键词: CatBoost Polycystic Ovary Syndrome (PCOS) SHAP values clustering machine learning prediction principal component analysis self-diagnosis subgroup study

来  源:   DOI:10.2196/29967

Abstract:
BACKGROUND: Artificial intelligence and digital health care have substantially advanced to improve and enhance medical diagnosis and treatment during the prolonged period of the COVID-19 global pandemic. In this study, we discuss the development of prediction models for the self-diagnosis of polycystic ovary syndrome (PCOS) using machine learning techniques.
OBJECTIVE: We aim to develop self-diagnostic prediction models for PCOS in potential patients and clinical providers. For potential patients, the prediction is based only on noninvasive measures such as anthropomorphic measures, symptoms, age, and other lifestyle factors so that the proposed prediction tool can be conveniently used without any laboratory or ultrasound test results. For clinical providers who can access patients\' medical test results, prediction models using all predictor variables can be adopted to help health providers diagnose patients with PCOS. We compare both prediction models using various error metrics. We call the former model the patient model and the latter, the provider model throughout this paper.
METHODS: In this retrospective study, a publicly available data set of 541 women\'s health information collected from 10 different hospitals in Kerala, India, including PCOS status, was acquired and used for analysis. We adopted the CatBoost method for classification, K-fold cross-validation for estimating the performance of models, and SHAP (Shapley Additive Explanations) values to explain the importance of each variable. In our subgroup study, we used k-means clustering and Principal Component Analysis to split the data set into 2 distinct BMI subgroups and compared the prediction results as well as the feature importance between the 2 subgroups.
RESULTS: We achieved 81% to 82.5% prediction accuracy of PCOS status without any invasive measures in the patient models and achieved 87.5% to 90.1% prediction accuracy using both noninvasive and invasive predictor variables in the provider models. Among noninvasive measures, variables including acanthosis nigricans, acne, hirsutism, irregular menstrual cycle, length of menstrual cycle, weight gain, fast food consumption, and age were more important in the models. In medical test results, the numbers of follicles in the right and left ovaries and anti-Müllerian hormone were ranked highly in feature importance. We also reported more detailed results in a subgroup study.
CONCLUSIONS: The proposed prediction models are ultimately expected to serve as a convenient digital platform with which users can acquire pre- or self-diagnosis and counsel for the risk of PCOS, with or without obtaining medical test results. It will enable women to conveniently access the platform at home without delay before they seek further medical care. Clinical providers can also use the proposed prediction tool to help diagnose PCOS in women.
摘要:
背景:在COVID-19全球大流行的长期期间,人工智能和数字医疗保健在改善和加强医疗诊断和治疗方面取得了实质性进展。在这项研究中,我们讨论了使用机器学习技术开发多囊卵巢综合征(PCOS)自我诊断的预测模型。
目的:我们的目标是在潜在患者和临床提供者中开发PCOS的自我诊断预测模型。对于潜在的患者,预测仅基于非侵入性措施,如拟人化措施,症状,年龄,和其他生活方式因素,以便可以方便地使用所提出的预测工具,而无需任何实验室或超声测试结果。对于可以访问患者医学检查结果的临床提供者,可以采用使用所有预测变量的预测模型来帮助医疗服务提供者诊断PCOS患者.我们使用各种误差指标比较了两种预测模型。我们称前者为病人模型,后者为病人模型,贯穿本文的提供者模型。
方法:在这项回顾性研究中,从喀拉拉邦的10家不同医院收集的541名妇女健康信息的公开数据集,印度,包括PCOS状态,被收购并用于分析。我们采用了CatBoost方法进行分类,用于估计模型性能的K折交叉验证,和SHAP(Shapley加法解释)值来解释每个变量的重要性。在我们的亚组研究中,我们使用k-均值聚类和主成分分析将数据集分成2个不同的BMI亚组,并比较了2个亚组之间的预测结果和特征重要性.
结果:我们在患者模型中,在没有任何侵入性措施的情况下,对PCOS状态的预测准确率达到了81%至82.5%,在提供者模型中使用非侵入性和侵入性预测变量,预测准确率达到了87.5%至90.1%。在非侵入性措施中,变量包括黑棘皮病,痤疮,多毛症,月经周期不规律,月经周期的长度,体重增加,快餐消费,年龄在模型中更为重要。在医学测试结果中,左右卵巢中的卵泡数量和抗苗勒管激素在特征重要性方面排名很高。我们还在一项亚组研究中报告了更详细的结果。
结论:所提出的预测模型最终有望成为一个方便的数字平台,用户可以通过该平台获得PCOS风险的预诊断或自我诊断以及咨询。有或没有获得医学测试结果。它将使妇女在寻求进一步医疗之前,可以在家中方便地使用平台。临床提供者还可以使用拟议的预测工具来帮助诊断女性的PCOS。
公众号