背景:医疗保健领域的数据量正在迅速上升,导致为任何给定的个人生成多个数据集。数据集成涉及将不同数据集中的变量映射在一起,以形成组合的数据集,然后可以将其用于进行不同类型的分析。然而,随着变量数量的增加,数据集的手动映射可能变得低效。另一种方法是通过机器学习使用文本分类来将变量分类为模式。
目的:我们的目标是创建和评估机器学习方法的使用,以整合来自健康信息寻求行为(HISB)数据库中数据集的数据。
方法:选择与研究领域相关的四个在线数据库进行整合。为数据集映射设计了两个实验:使用一个数据源的数据库内映射,和数据库间映射,以在四个数据库之间映射数据集。我们比较了逻辑回归(LR),随机森林分类器(RFC),和神经网络(NN)模型通过F1-score进行两种方法的集成。第三个实验是使用所有可用数据来创建用于对数据集中的HISB变量进行分类的模型的消融研究。
结果:在数据库内映射中,LR分类器的平均F1评分(0.787)优于RFC评分(0.767)和全连接NN评分(0.735).在数据库间映射中,LR(0.245)得分最高,然而,这取决于使用哪个数据库作为训练源。使用所有的数据库,这三个模型能够正确分类90-91%的变量.删除一个数据集提高了分数,并产生了能够正确分类95-96%的HISB变量的模型。
结论:作为数据集成的一部分,神经网络可以用作映射数据集变量的方法。开发的模型可用于对数据库中的HISB术语进行分类。
BACKGROUND: The amount of data in health care is rapidly rising, leading to multiple datasets generated for any given individual. Data integration involves mapping variables in different datasets together to form a combined dataset which can then be used to conduct different types of analyses. However, with increasing numbers of variables, manual mapping of a dataset can become inefficient. Another approach is to use text classification through machine learning to classify the variables to a schema.
OBJECTIVE: Our aim was to create and evaluate the use of machine learning methods for the integration of data from datasets across health information-seeking behavior (HISB) databases.
METHODS: Four online databases relevant to the research field were selected for integration. Two experiments were designed for dataset mapping: intra-database mapping using the one data source, and inter-database mapping to map datasets between the four databases. We compared logistic regression (LR), a random forest classifier (RFC), and neural network (NN) models by F1-score for two methods of integration. A third experiment was an ablation study that used all the available data to create a model for classifying HISB variables in a dataset.
RESULTS: In intra-database mapping, the mean F1 score for an LR classifier (0.787) was better than the RFC score (0.767) and fully connected NN (0.735). In inter-database mapping, the LR (0.245) scored best, however, this was dependent on which database was used as a training source. Using all the databases, these top three models were able to correctly classify 90-91% of the variables. Removing one dataset improved scores and resulted in a model able to correctly classify 95-96% of the HISB variables.
CONCLUSIONS: As part of data integration, a neural network can be used as an approach to map the variables of a dataset. The developed models can be used to classify the HISB terms in a database.