Word embedding models

  • 文章类型: Journal Article
    患者门户信息通常涉及特定的临床现象(例如,正在接受乳腺癌治疗的患者)和,因此,越来越受到生物医学研究的重视。这些消息需要自然语言处理,而单词嵌入模型,如word2vec,有可能从文本中提取有意义的信号,它们不适用于患者门户消息。这是因为嵌入模型通常需要数百万个训练样本来充分表示语义,而与特定临床现象相关的患者入口信息的量通常相对较小。我们介绍了一种对word2vec模型的新颖改编,PK-word2vec(其中PK代表先验知识),用于小规模的消息。PK-word2vec包含了医学词汇最相似的术语(包括问题,治疗,和测试)以及来自两个预训练嵌入模型的非医学单词作为先验知识,以改善训练过程。我们将PK-word2vec应用于2004年12月至2017年11月在范德比尔特大学医学中心电子健康记录系统中发送的患者门户消息的案例研究。我们通过一组1000个任务来评估模型,每个单词的相关性与一组由PK-word2vec生成的五个最相似的单词和一组由标准word2vec模型生成的五个最相似的单词进行比较。我们招募了200名亚马逊土耳其机械(AMT)工人和7名医学生来执行任务。该数据集由1389个患者记录组成,包括137,554条消息和10,683个独特单词。已有7981个非医学单词和1116个医学单词的先验知识。在90%以上的任务中,两位审稿人均表示,PK-word2vec比标准word2vec生成的相似词更多(p=0.01).对于两组审阅者之间的任务选择的所有比较,AMT工作者与医学生的评估差异都可以忽略不计(配对t检验下p=0.774)。PK-word2vec可以从小型消息语料库中有效地学习单词表示,标志着在处理患者门户消息方面的显著进步。
    Patient portal messages often relate to specific clinical phenomena (e.g., patients undergoing treatment for breast cancer) and, as a result, have received increasing attention in biomedical research. These messages require natural language processing and, while word embedding models, such as word2vec, have the potential to extract meaningful signals from text, they are not readily applicable to patient portal messages. This is because embedding models typically require millions of training samples to sufficiently represent semantics, while the volume of patient portal messages associated with a particular clinical phenomenon is often relatively small. We introduce a novel adaptation of the word2vec model, PK-word2vec (where PK stands for prior knowledge), for small-scale messages. PK-word2vec incorporates the most similar terms for medical words (including problems, treatments, and tests) and non-medical words from two pre-trained embedding models as prior knowledge to improve the training process. We applied PK-word2vec in a case study of patient portal messages in the Vanderbilt University Medical Center electric health record system sent by patients diagnosed with breast cancer from December 2004 to November 2017. We evaluated the model through a set of 1000 tasks, each of which compared the relevance of a given word to a group of the five most similar words generated by PK-word2vec and a group of the five most similar words generated by the standard word2vec model. We recruited 200 Amazon Mechanical Turk (AMT) workers and 7 medical students to perform the tasks. The dataset was composed of 1389 patient records and included 137,554 messages with 10,683 unique words. Prior knowledge was available for 7981 non-medical and 1116 medical words. In over 90% of the tasks, both reviewers indicated PK-word2vec generated more similar words than standard word2vec (p = 0.01).The difference in the evaluation by AMT workers versus medical students was negligible for all comparisons of tasks\' choices between the two groups of reviewers ( p = 0.774 under a paired t-test). PK-word2vec can effectively learn word representations from a small message corpus, marking a significant advancement in processing patient portal messages.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景患者门户信息通常涉及特定的临床现象(例如,正在接受乳腺癌治疗的患者)和,因此,越来越受到生物医学研究的重视。这些消息需要自然语言处理,而单词嵌入模型,如word2vec,有可能从文本中提取有意义的信号,它们不适用于患者门户消息。这是因为嵌入模型通常需要数百万个训练样本来充分表示语义,而与特定临床现象相关的患者入口信息的量通常相对较小。目的我们介绍一种新颖的word2vec模型改编,PK-word2vec,用于小规模的消息。方法PK-word2vec结合了医学词汇最相似的术语(包括问题,治疗,和测试)以及来自两个预训练嵌入模型的非医学单词作为先验知识,以改善训练过程。我们将PK-word2vec应用于2004年12月至2017年11月在范德比尔特大学医学中心电子健康记录系统中由诊断为乳腺癌的患者发送的患者门户消息。我们通过一组1000个任务来评估模型,每个单词的相关性与一组由PK-word2vec生成的五个最相似的单词和一组由标准word2vec模型生成的五个最相似的单词进行比较。我们招募了200名亚马逊土耳其机械(AMT)工人和7名医学生来执行任务。结果数据集由1389份患者记录组成,包括137,554条消息和10,683个独特单词。有7,981个非医学单词和1,116个医学单词的先验知识。在90%以上的任务中,两位审稿人均表示,PK-word2vec比标准word2vec生成的相似词更多(p=0.01).对于两组审阅者之间的任务选择的所有比较,AMT工作者与医学生的评估差异都可以忽略不计(配对t检验下p=0.774)。Conclusions.PK-word2vec可以从小型消息语料库中有效地学习单词表示,标志着在处理患者门户消息方面的显著进步。
    UNASSIGNED: Patient portal messages often relate to specific clinical phenomena (e.g., patients undergoing treatment for breast cancer) and, as a result, have received increasing attention in biomedical research. These messages require natural language processing and, while word embedding models, such as word2vec, have the potential to extract meaningful signals from text, they are not readily applicable to patient portal messages. This is because embedding models typically require millions of training samples to sufficiently represent semantics, while the volume of patient portal messages associated with a particular clinical phenomenon is often relatively small.
    UNASSIGNED: We introduce a novel adaptation of the word2vec model, PK-word2vec, for small-scale messages.
    UNASSIGNED: PK-word2vec incorporates the most similar terms for medical words (including problems, treatments, and tests) and non-medical words from two pre-trained embedding models as prior knowledge to improve the training process. We applied PK-word2vec on patient portal messages in the Vanderbilt University Medical Center electric health record system sent by patients diagnosed with breast cancer from December 2004 to November 2017. We evaluated the model through a set of 1000 tasks, each of which compared the relevance of a given word to a group of the five most similar words generated by PK-word2vec and a group of the five most similar words generated by the standard word2vec model. We recruited 200 Amazon Mechanical Turk (AMT) workers and 7 medical students to perform the tasks.
    UNASSIGNED: The dataset was composed of 1,389 patient records and included 137,554 messages with 10,683 unique words. Prior knowledge was available for 7,981 non-medical and 1,116 medical words. In over 90% of the tasks, both reviewers indicated PK-word2vec generated more similar words than standard word2vec (p=0.01).The difference in the evaluation by AMT workers versus medical students was negligible for all comparisons of tasks\' choices between the two groups of reviewers (p = 0.774 under a paired t-test).
    UNASSIGNED: PK-word2vec can effectively learn word representations from a small message corpus, marking a significant advancement in processing patient portal messages.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    词价是词义组织的主要维度之一。预测性自然语言处理模型计算的基于共现的相似性在表示情感内容方面相对较差,但以他们自己的方式非常强大。这里,我们确定了这两种典型但不同的词义表示方式在人脑中的功能和神经解剖学上是如何相互关联的。我们重新分析了fMRI对词价的研究。使用了基于共现的模型,并将与大脑活动模式相似性的相关性与情感相似性的相关性进行了比较。情感和基于共现的相似性之间的相关性很低(r=0.065),确认通过共现建模很难捕捉到影响。在全脑代表性相似性分析中,单词嵌入相似性与局限于左侧颞上沟的区域中活动模式之间的相似性显着相关,在较小的程度上是对的。情感词相似性与同一区域活动模式的相似性相关,确认先前的发现。情感相似性效应比基于共现的相似性效应更广泛地扩展到颞叶上皮层之外。在排除了情感相似性的影响(反之亦然)之后,基于共现的相似性的影响保持不变。最后,词义的不同方面,源自情感判断或单词共现,在上颞叶语言皮层中以神经解剖学重叠但功能独立的方式表示。
    Word valence is one of the principal dimensions in the organization of word meaning. Co-occurrence-based similarities calculated by predictive natural language processing models are relatively poor at representing affective content, but very powerful in their own way. Here, we determined how these two canonical but distinct ways of representing word meaning relate to each other in the human brain both functionally and neuroanatomically. We re-analysed an fMRI study of word valence. A co-occurrence-based model was used and the correlation with the similarity of brain activity patterns was compared to that of affective similarities. The correlation between affective and co-occurrence-based similarities was low (r = 0.065), confirming that affect was captured poorly by co-occurrence modelling. In a whole-brain representational similarity analysis, word embedding similarities correlated significantly with the similarity between activity patterns in a region confined to the superior temporal sulcus to the left, and to a lesser degree to the right. Affective word similarities correlated with the similarity in activity patterns in this same region, confirming previous findings. The affective similarity effect extended more widely beyond the superior temporal cortex than the effect of co-occurrence-based similarities did. The effect of co-occurrence-based similarities remained unaltered after partialling out the effect of affective similarities (and vice versa). To conclude, different aspects of word meaning, derived from affective judgements or from word co-occurrences, are represented in superior temporal language cortex in a neuroanatomically overlapping but functionally independent manner.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    这篇数据文章介绍了一个可重复性数据集,目的是允许所有实验的精确复制,结果和数据表在我们的配套论文中介绍(Lastra-Díaz等人。,2019),其中介绍了文献中关于基于本体的语义相似度方法和单词嵌入(WE)的最大实验调查。我们所有实验的实施,以及收集从中得出的所有原始数据,基于HESML库中所有方法的软件实现和评估(Lastra-Díaz等人。,2017),以及他们随后用Reprozip(Chirigati等人。,2016)。原始数据是由数据文件的集合组成的,这些数据文件收集了在任何基准中评估的每个单词对的每种方法返回的原始单词相似度值。通过运行R语言脚本来处理原始数据文件,目的是计算中报告的所有评估指标(Lastra-Díaz等人。,2019),比如皮尔逊和斯皮尔曼相关性,谐波得分和统计显著性p值,以及自动生成我们的配套文件中显示的所有数据表。我们的数据集提供所有输入数据文件,资源和互补的软件工具,从头开始复制我们所有的实验数据,统计分析和报告数据。最后,我们的可重复性数据集提供了一个独立的实验平台,该平台允许通过建立新的实验,包括其他未考虑的方法或单词相似性基准,来运行新的单词相似性基准.
    This data article introduces a reproducibility dataset with the aim of allowing the exact replication of all experiments, results and data tables introduced in our companion paper (Lastra-Díaz et al., 2019), which introduces the largest experimental survey on ontology-based semantic similarity methods and Word Embeddings (WE) for word similarity reported in the literature. The implementation of all our experiments, as well as the gathering of all raw data derived from them, was based on the software implementation and evaluation of all methods in HESML library (Lastra-Díaz et al., 2017), and their subsequent recording with Reprozip (Chirigati et al., 2016). Raw data is made up by a collection of data files gathering the raw word-similarity values returned by each method for each word pair evaluated in any benchmark. Raw data files were processed by running a R-language script with the aim of computing all evaluation metrics reported in (Lastra-Díaz et al., 2019), such as Pearson and Spearman correlation, harmonic score and statistical significance p-values, as well as to generate automatically all data tables shown in our companion paper. Our dataset provides all input data files, resources and complementary software tools to reproduce from scratch all our experimental data, statistical analysis and reported data. Finally, our reproducibility dataset provides a self-contained experimentation platform which allows to run new word similarity benchmarks by setting up new experiments including other unconsidered methods or word similarity benchmarks.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

  • 文章类型: Journal Article
    Natural language processing (NLP) of health-related data is still an expertise demanding, and resource expensive process. We created a novel, open source rapid clinical text mining system called NimbleMiner. NimbleMiner combines several machine learning techniques (word embedding models and positive only labels learning) to facilitate the process in which a human rapidly performs text mining of clinical narratives, while being aided by the machine learning components.
    This manuscript describes the general system architecture and user Interface and presents results of a case study aimed at classifying fall-related information (including fall history, fall prevention interventions, and fall risk) in homecare visit notes.
    We extracted a corpus of homecare visit notes (n = 1,149,586) for 89,459 patients from a large US-based homecare agency. We used a gold standard testing dataset of 750 notes annotated by two human reviewers to compare the NimbleMiner\'s ability to classify documents regarding whether they contain fall-related information with a previously developed rule-based NLP system.
    NimbleMiner outperformed the rule-based system in almost all domains. The overall F- score was 85.8% compared to 81% by the rule based-system with the best performance for identifying general fall history (F = 89% vs. F = 85.1% rule-based), followed by fall risk (F = 87% vs. F = 78.7% rule-based), fall prevention interventions (F = 88.1% vs. F = 78.2% rule-based) and fall within 2 days of the note date (F = 83.1% vs. F = 80.6% rule-based). The rule-based system achieved slightly better performance for fall within 2 weeks of the note date (F = 81.9% vs. F = 84% rule-based).
    NimbleMiner outperformed other systems aimed at fall information classification, including our previously developed rule-based approach. These promising results indicate that clinical text mining can be implemented without the need for large labeled datasets necessary for other types of machine learning. This is critical for domains with little NLP developments, like nursing or allied health professions.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

公众号