关键词: Breast cancer Hormonal therapy Natural language processing Patient portal messages Word embedding models Word2vec

Mesh : Humans Breast Neoplasms Female Natural Language Processing Patient Portals Semantics Electronic Health Records

来  源:   DOI:10.1038/s41598-024-66319-z   PDF(Pubmed)

Abstract:
Patient portal messages often relate to specific clinical phenomena (e.g., patients undergoing treatment for breast cancer) and, as a result, have received increasing attention in biomedical research. These messages require natural language processing and, while word embedding models, such as word2vec, have the potential to extract meaningful signals from text, they are not readily applicable to patient portal messages. This is because embedding models typically require millions of training samples to sufficiently represent semantics, while the volume of patient portal messages associated with a particular clinical phenomenon is often relatively small. We introduce a novel adaptation of the word2vec model, PK-word2vec (where PK stands for prior knowledge), for small-scale messages. PK-word2vec incorporates the most similar terms for medical words (including problems, treatments, and tests) and non-medical words from two pre-trained embedding models as prior knowledge to improve the training process. We applied PK-word2vec in a case study of patient portal messages in the Vanderbilt University Medical Center electric health record system sent by patients diagnosed with breast cancer from December 2004 to November 2017. We evaluated the model through a set of 1000 tasks, each of which compared the relevance of a given word to a group of the five most similar words generated by PK-word2vec and a group of the five most similar words generated by the standard word2vec model. We recruited 200 Amazon Mechanical Turk (AMT) workers and 7 medical students to perform the tasks. The dataset was composed of 1389 patient records and included 137,554 messages with 10,683 unique words. Prior knowledge was available for 7981 non-medical and 1116 medical words. In over 90% of the tasks, both reviewers indicated PK-word2vec generated more similar words than standard word2vec (p = 0.01).The difference in the evaluation by AMT workers versus medical students was negligible for all comparisons of tasks\' choices between the two groups of reviewers ( p = 0.774 under a paired t-test). PK-word2vec can effectively learn word representations from a small message corpus, marking a significant advancement in processing patient portal messages.
摘要:
患者门户信息通常涉及特定的临床现象(例如,正在接受乳腺癌治疗的患者)和,因此,越来越受到生物医学研究的重视。这些消息需要自然语言处理,而单词嵌入模型,如word2vec,有可能从文本中提取有意义的信号,它们不适用于患者门户消息。这是因为嵌入模型通常需要数百万个训练样本来充分表示语义,而与特定临床现象相关的患者入口信息的量通常相对较小。我们介绍了一种对word2vec模型的新颖改编,PK-word2vec(其中PK代表先验知识),用于小规模的消息。PK-word2vec包含了医学词汇最相似的术语(包括问题,治疗,和测试)以及来自两个预训练嵌入模型的非医学单词作为先验知识,以改善训练过程。我们将PK-word2vec应用于2004年12月至2017年11月在范德比尔特大学医学中心电子健康记录系统中发送的患者门户消息的案例研究。我们通过一组1000个任务来评估模型,每个单词的相关性与一组由PK-word2vec生成的五个最相似的单词和一组由标准word2vec模型生成的五个最相似的单词进行比较。我们招募了200名亚马逊土耳其机械(AMT)工人和7名医学生来执行任务。该数据集由1389个患者记录组成,包括137,554条消息和10,683个独特单词。已有7981个非医学单词和1116个医学单词的先验知识。在90%以上的任务中,两位审稿人均表示,PK-word2vec比标准word2vec生成的相似词更多(p=0.01).对于两组审阅者之间的任务选择的所有比较,AMT工作者与医学生的评估差异都可以忽略不计(配对t检验下p=0.774)。PK-word2vec可以从小型消息语料库中有效地学习单词表示,标志着在处理患者门户消息方面的显著进步。
公众号