word embeddings

单词嵌入
  • 文章类型: Journal Article
    我们介绍了一个新的情感数据集,语义,以及在数据收集时所有面部表情符号的描述性规范。我们收集并检查了来自138名德语使用者的表情符号的主观评级,包括五个基本维度:效价,唤醒,熟悉度,清晰度,视觉复杂性。此外,我们提供表情符号使用的绝对频率计数,来自广泛的Twitter语料库,以及一个更小的WhatsApp数据库。我们的结果复制了词汇项目的唤醒和效价之间建立的二次关系,也以文字而闻名。我们还报告变量之间的关联:例如,表情符号的主观熟悉程度与其使用频率密切相关,并与其情感效价和含义清晰呈正相关。我们建立与面部表情符号相关的含义,通过要求参与者为每个表情符号提供最多三个描述。使用这些语言数据,我们计算了每个表情符号的向量嵌入,能够探索它们在语义空间中的分布。我们基于描述的表情符号向量嵌入不仅捕获表情符号的典型含义成分,比如它们的价,而且在体现表情符号与词语的语义关系方面也超越了简单的定义和直接的emoji2vec模型。我们的数据集由于其强大的可靠性和有效性而脱颖而出。面部表情符号的这种新语义规范影响了高度受控实验的未来设计,该实验专注于表情符号的认知处理,他们的词汇表示,以及它们的语言属性。
    We introduce a novel dataset of affective, semantic, and descriptive norms for all facial emojis at the point of data collection. We gathered and examined subjective ratings of emojis from 138 German speakers along five essential dimensions: valence, arousal, familiarity, clarity, and visual complexity. Additionally, we provide absolute frequency counts of emoji use, drawn from an extensive Twitter corpus, as well as a much smaller WhatsApp database. Our results replicate the well-established quadratic relationship between arousal and valence of lexical items, also known for words. We also report associations among the variables: for example, the subjective familiarity of an emoji is strongly correlated with its usage frequency, and positively associated with its emotional valence and clarity of meaning. We establish the meanings associated with face emojis, by asking participants for up to three descriptions for each emoji. Using this linguistic data, we computed vector embeddings for each emoji, enabling an exploration of their distribution within the semantic space. Our description-based emoji vector embeddings not only capture typical meaning components of emojis, such as their valence, but also surpass simple definitions and direct emoji2vec models in reflecting the semantic relationship between emojis and words. Our dataset stands out due to its robust reliability and validity. This new semantic norm for face emojis impacts the future design of highly controlled experiments focused on the cognitive processing of emojis, their lexical representation, and their linguistic properties.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    随着技术创新,现实世界中的企业正在管理每一个数据,因为它们可以被挖掘以获得商业智能(BI)。然而,当数据来自多个来源时,这可能会导致重复的记录。由于数据至关重要,消除重复实体对数据集成也很重要,性能和资源优化。为了实现可靠的重复记录删除系统,迟到,深度学习可以通过基于学习的方法提供令人兴奋的规定。深度ER是最近用于处理结构化数据中重复项的基于深度学习的方法之一。使用它作为参考模型,在本文中,我们提出了一个称为增强型深度学习的基于记录重复数据删除(EDL-RD)的框架,以进一步提高性能。为此,我们利用了长短期记忆(LSTM)的变体以及各种属性组成,相似性度量,以及数值和空值解析。我们提出了一种称为基于高效学习的重复记录删除(ELbRD)的算法。该算法利用上述增强来扩展参考模型。一项实证研究表明,所提出的带有扩展的框架优于现有方法。
    With technological innovations, enterprises in the real world are managing every iota of data as it can be mined to derive business intelligence (BI). However, when data comes from multiple sources, it may result in duplicate records. As data is given paramount importance, it is also significant to eliminate duplicate entities towards data integration, performance and resource optimization. To realize reliable systems for record deduplication, late, deep learning could offer exciting provisions with a learning-based approach. Deep ER is one of the deep learning-based methods used recently for dealing with the elimination of duplicates in structured data. Using it as a reference model, in this paper, we propose a framework known as Enhanced Deep Learning-based Record Deduplication (EDL-RD) for improving performance further. Towards this end, we exploited a variant of Long Short Term Memory (LSTM) along with various attribute compositions, similarity metrics, and numerical and null value resolution. We proposed an algorithm known as Efficient Learning based Record Deduplication (ELbRD). The algorithm extends the reference model with the aforementioned enhancements. An empirical study has revealed that the proposed framework with extensions outperforms existing methods.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    与受灾社区的顺畅互动可以创造和加强其社会资本,导致在提供成功的灾后恢复援助方面更有效。为了理解交互类型之间的关系,产生的社会资本的力量,并提供成功的灾后恢复援助,需要复杂的人种学定性研究,但它可能仍然是说明性的,因为它是基于,至少在某种程度上,根据研究人员的直觉。因此,本文提供了一种创新的研究方法,采用基于定量人工智能(AI)的语言模型,这允许研究人员重新检查数据,从而验证定性研究的结果,并收集可能错过的其他见解。本文认为,人脉紧密的人员和以宗教为基础的社区活动有助于通过社区内部的联系和与外部机构的联系以及混合方法来增强社会资本,基于基于AI的语言模型,有效加强基于文本的定性研究。
    Smooth interaction with a disaster-affected community can create and strengthen its social capital, leading to greater effectiveness in the provision of successful post-disaster recovery aid. To understand the relationship between the types of interaction, the strength of social capital generated, and the provision of successful post-disaster recovery aid, intricate ethnographic qualitative research is required, but it is likely to remain illustrative because it is based, at least to some degree, on the researcher\'s intuition. This paper thus offers an innovative research method employing a quantitative artificial intelligence (AI)-based language model, which allows researchers to re-examine data, thereby validating the findings of the qualitative research, and to glean additional insights that might otherwise have been missed. This paper argues that well-connected personnel and religiously-based communal activities help to enhance social capital by bonding within a community and linking to outside agencies and that mixed methods, based on the AI-based language model, effectively strengthen text-based qualitative research.
    災害の影響を受けたコミュニティとの円滑な交流は、コミュニティの社会資本を構築および強化することができ、災害後の復興支援をより効果的に提供することにつながる。相互作用の種類、生成される社会資本の強さ、および成功した災害復興支援の提供の関係を理解するには、複雑な民族誌的かつ質的研究が必要であるが、それは少なくともある程度は研究者の直感に基づいているため、例示にとどまる可能性が高い。このような研究に次元を加えて強化するために、この論文では、定量的な AI ベースの言語モデルを使用した革新的研究手法を提示する。本モデルを使用することで、質的研究のデータを再調査して結果を検証し、見逃されていた可能性のあるその他の洞察の収集が可能となる。本論文では、人間関係の良好な人材と宗教に基づいた共同体活動が、コミュニティ内での絆や外部機関とのつながりによって社会資本の強化に役立つと論じている。また、AI ベースの言語モデルに基づく混合手法により、テキストベースの質的研究を効果的に強化できると論じている。.
    与受灾社区的顺畅互动可创造和加强社区的社会资本,从而更有效地提供灾后恢复援助。为了理解互动类型所产生的社会资本强度以及成功的灾后恢复援助之间的关系,需要进行复杂的民族志定性研究,但它可能仍具有说明性,因为它至少在某种程度上基于研究人员的直觉。为了增加维度并加强此类研究,本文提出了一种采用基于定量人工智能的语言模型的创新研究方法。该模型允许研究人员重新审查数据,从而验证定性研究的结果,并收集定性研究可能错过的额外见解。本文认为,人脉广泛的人员和以基于宗教的社区活动有助于通过社区内的联系和与外部机构的联系来增强社会资本。它还认为,基于人工智能语言模型的混合方法能有效地加强基于文本的定性研究。.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    由于单词可以具有取决于句子上下文的多种含义,基因可以有各种功能,取决于周围的生物系统。基因功能的这种多效性受到本体论的限制,在不考虑生物学背景的情况下注释基因功能。我们认为,遗传学中的基因功能问题可能是由自然语言处理中最近的技术飞跃所决定的,其中可以从不同的语言上下文中自动学习单词语义的表示。与1990年代将语义建模为“is-a”关系的努力相反,现代分布语义将单词表示为学习的语义空间中的向量,并推动了基于变压器的模型的当前进步,例如大型语言模型和生成预训练变压器。基因功能在细胞环境中的分布的想法的类似转变可能会在从大型生物数据集中进行数据驱动学习以告知基因功能方面实现类似的突破。
    As words can have multiple meanings that depend on sentence context, genes can have various functions that depend on the surrounding biological system. This pleiotropic nature of gene function is limited by ontologies, which annotate gene functions without considering biological contexts. We contend that the gene function problem in genetics may be informed by recent technological leaps in natural language processing, in which representations of word semantics can be automatically learned from diverse language contexts. In contrast to efforts to model semantics as \"is-a\" relationships in the 1990s, modern distributional semantics represents words as vectors in a learned semantic space and fuels current advances in transformer-based models such as large language models and generative pre-trained transformers. A similar shift in thinking of gene functions as distributions over cellular contexts may enable a similar breakthrough in data-driven learning from large biological datasets to inform gene function.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    这项研究调查了种族的交集,性别,以及围绕心理健康和疾病的语言中的犯罪。在2000年至2023年之间,将单词嵌入的计算方法应用于美国主要报纸的全文数据,我表明心理健康的景观被广泛地种族化为黑色,挑战精神疾病作为一种主要的白人现象的概念。关于精神疾病的文化观念被性别化,女性被医疗化,男性被犯罪化,然而,某些术语模糊了疾病和犯罪之间的界限。我强调了心理健康语言中的刻板印象如何使围绕男性心理健康的污名化,并证明了社会控制对黑人的显着影响。最后,我为心理健康运动提出了建议,倡导围绕男性心理健康进行更具包容性的讨论,并修改了以人为中心的语言。
    This study investigates the intersection of race, gender, and criminality in the language surrounding mental health and illness. Applying computational methods of word embeddings to full text data from major American newspapers between 2000 and 2023, I show that the landscape of mental health is broadly racialized as black, challenging the notion of mental illness as a predominantly white phenomenon. Cultural ideas about mental illness are gendered such that women are medicalized and men are criminalized, yet certain terms blur the boundary between illness and criminality. I highlight how stereotypes embedded in mental health language perpetuate stigma around men\'s mental health and justify social control with notable implications for black men. I conclude with recommendations for the mental health movement by advocating for more inclusive discussions around men\'s mental health and revised person-centric language.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    今天,许多社会群体面临负面的刻板印象。这种消极情绪是社会的稳定特征,如果是,Whatmechanismsmaintainstabilitybothwithinandacrossgrouptargets?answeringthesetheoreticallyandpracticallyimportantquestionsrequiresdataontalwaysexaminedsimultanelyoverhistoricalandsocietalscales,这只能通过自然语言处理的最新进展来实现。在两项研究中,我们使用100多年(1900-2000)来自数百万本英语书的单词嵌入,并为58个受污名化的群体提取刻板印象。研究1检查了骨料,通过对这些群体进行平均,社会层面的刻板印象消极趋势。结果表明,总体负性存在惊人的持久性(没有有意义的斜率),这表明社会维持着稳定的负面刻板印象。研究2引入并测试了一个新的框架,该框架确定了随着时间的推移维持刻板印象消极的潜在机制。我们发现了这种聚合持久性的两个关键来源的证据:组内“再现性”(例如,刻板印象消极可以通过使用具有相同基本含义的不同特征来维持)和跨组“替换”(例如,一组的消极情绪转移到其他相关组)。这些发现为社会中维护污名化的机制提供了新的历史证据,并提出了有关未来污名化变化可能性的新问题。
    Today, many social groups face negative stereotypes. Is such negativity a stable feature of society and, if so, what mechanisms maintain stability both within and across group targets? Answering these theoretically and practically important questions requires data on dozens of group stereotypes examined simultaneously over historical and societal scales, which is only possible through recent advances in Natural Language Processing. Across two studies, we use word embeddings from millions of English-language books over 100 years (1900-2000) and extract stereotypes for 58 stigmatized groups. Study 1 examines aggregate, societal-level trends in stereotype negativity by averaging across these groups. Results reveal striking persistence in aggregate negativity (no meaningful slope), suggesting that society maintains a stable level of negative stereotypes. Study 2 introduces and tests a new framework identifying potential mechanisms upholding stereotype negativity over time. We find evidence of two key sources of this aggregate persistence: within-group \"reproducibility\" (e.g., stereotype negativity can be maintained by using different traits with the same underlying meaning) and across-group \"replacement\" (e.g., negativity from one group is transferred to other related groups). These findings provide novel historical evidence of mechanisms upholding stigmatization in society and raise new questions regarding the possibility of future stigma change.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:基于传统文献的发现是基于通过公共中点将从单独出版物中提取的知识对连接起来,以得出以前看不见的知识对。为了避免经常与这种方法相关的过度生成,我们探索了一种基于单词进化的替代方法。单词进化检查单词的变化上下文,以识别其含义或关联的变化。我们研究了使用变化的单词上下文来检测适合重新利用的药物的可能性。
    结果:词嵌入,代表单词的上下文,是由MEDLINE中按时间顺序排列的出版物以每两个月为间隔构建的,为每个单词生成一个单词嵌入的时间序列。只专注于临床药物,在时间序列的最后时间段中再利用的任何药物都被注释为积极的例子。关于药物再利用的决定是基于统一医疗语言系统(UMLS),或使用MEDLINE中的SemRep提取的语义三元组。
    结论:注释数据允许深度学习分类,通过5倍交叉验证,要执行和多种架构要探索。使用UMLS标签的性能为65%,81%使用SemRep标签,表明该技术适用于检测用于再利用的候选药物。调查还表明,不同的体系结构与可用的训练数据量相关联,因此每种注释方法都应训练不同的模型。
    BACKGROUND: Traditional literature based discovery is based on connecting knowledge pairs extracted from separate publications via a common mid point to derive previously unseen knowledge pairs. To avoid the over generation often associated with this approach, we explore an alternative method based on word evolution. Word evolution examines the changing contexts of a word to identify changes in its meaning or associations. We investigate the possibility of using changing word contexts to detect drugs suitable for repurposing.
    RESULTS: Word embeddings, which represent a word\'s context, are constructed from chronologically ordered publications in MEDLINE at bi-monthly intervals, yielding a time series of word embeddings for each word. Focusing on clinical drugs only, any drugs repurposed in the final time segment of the time series are annotated as positive examples. The decision regarding the drug\'s repurposing is based either on the Unified Medical Language System (UMLS), or semantic triples extracted using SemRep from MEDLINE.
    CONCLUSIONS: The annotated data allows deep learning classification, with a 5-fold cross validation, to be performed and multiple architectures to be explored. Performance of 65% using UMLS labels, and 81% using SemRep labels is attained, indicating the technique\'s suitability for the detection of candidate drugs for repurposing. The investigation also shows that different architectures are linked to the quantities of training data available and therefore that different models should be trained for every annotation approach.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    原则上,基本概念人,女人,人应该平等地适用于不同性别和种族/民族的人。在现实中,这些概念可能会优先考虑某些群体。基于男性中心主义的跨学科理论,我们假设(a)人与男人的联系比女人更多(人=男人),(b)女人与女人的联系比男人与男人的联系更多(即,女性更性别化:性别=女性)。我们应用了自然语言处理工具(具体来说,单词嵌入)到数百万个人的语言输出(具体地说,普通爬行语料库)。我们发现假设的人=男性/性别=女性偏见。西班牙裔和白人的这种偏见更强(与亚洲)女性和男性。我们还发现了在概念人中偏爱白人的平行偏见,女人,和男人。西方社会将男性和白人视为人,将“其他人”女性视为有性别的人,对跨政策和决策环境的公平性产生影响。
    In principle, the fundamental concepts person, woman, and man should apply equally to people of different genders and races/ethnicities. In reality, these concepts might prioritize certain groups over others. Based on interdisciplinary theories of androcentrism, we hypothesized that (a) person is more associated with men than women (person = man) and (b) woman is more associated with women than man is with men (i.e., women are more gendered: gender = woman). We applied natural language processing tools (specifically, word embeddings) to the linguistic output of millions of individuals (specifically, the Common Crawl corpus). We found the hypothesized person = man / gender = woman bias. This bias was stronger about Hispanic and White (vs. Asian) women and men. We also uncovered parallel biases favoring White individuals in the concepts person, woman, and man. Western society prioritizes men and White individuals as people and \"others\" women as people with gender, with implications for equity across policy- and decision-making contexts.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    基于社会群体的身份相交。“女人”的含义是通过增加“富婆”或“穷婆”中的社会阶层来调节的。\“这种交叉性如何在日常语言中大规模运作?哪个交叉性占主导地位(最常见)?什么品质(积极性,能力,温暖)归因于每个十字路口?在这项研究中,我们有可能通过制定一个逐步的程序来解决这些问题,灵活的交叉刻板印象提取(FISE),应用于对数十亿个英语互联网文本单词进行培训的单词嵌入(GloVe;BERT),揭示对交叉刻板印象的见解。首先,将FISE应用于跨性别交叉点的职业刻板印象,种族,班级显示与职业人口统计数据的真实数据一致,提供初始验证。第二,将FISE应用于特质形容词在主导日常英语语言方面表现出强烈的男性中心主义(男性)和种族中心主义(白人)(例如白人男性与59%的特质相关;黑人女性与5%)。相关性状也揭示了交叉差异:优势交叉组,尤其是涉及Rich的交叉路口,有更常见的,积极的,温暖,主管,和显性特征联想。一起,FISE的经验见解说明了其在透明和有效地量化现有大型文本语料库中的交叉刻板印象方面的效用,具有在前所未有的时间和地点扩展交叉性研究的潜力。该项目进一步建立了必要的基础设施,以对交叉身份的新兴特性进行新的研究。
    Social group-based identities intersect. The meaning of \"woman\" is modulated by adding social class as in \"rich woman\" or \"poor woman.\" How does such intersectionality operate at-scale in everyday language? Which intersections dominate (are most frequent)? What qualities (positivity, competence, warmth) are ascribed to each intersection? In this study, we make it possible to address such questions by developing a stepwise procedure, Flexible Intersectional Stereotype Extraction (FISE), applied to word embeddings (GloVe; BERT) trained on billions of words of English Internet text, revealing insights into intersectional stereotypes. First, applying FISE to occupation stereotypes across intersections of gender, race, and class showed alignment with ground-truth data on occupation demographics, providing initial validation. Second, applying FISE to trait adjectives showed strong androcentrism (Men) and ethnocentrism (White) in dominating everyday English language (e.g. White + Men are associated with 59% of traits; Black + Women with 5%). Associated traits also revealed intersectional differences: advantaged intersectional groups, especially intersections involving Rich, had more common, positive, warm, competent, and dominant trait associates. Together, the empirical insights from FISE illustrate its utility for transparently and efficiently quantifying intersectional stereotypes in existing large text corpora, with potential to expand intersectionality research across unprecedented time and place. This project further sets up the infrastructure necessary to pursue new research on the emergent properties of intersectional identities.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    抗癌肽(ACP)是一组表现出抗肿瘤性质的肽。ACP在癌症预防中的应用可以为常规癌症治疗提供可行的替代品。因为它们具有更高的选择性和安全性。最近的科学进步引起了对基于肽的疗法的兴趣,所述基于肽的疗法提供了有效处理预期细胞而不负面影响正常细胞的优点。然而,随着肽序列的数量继续迅速增加,开发可靠和精确的预测模型成为一项具有挑战性的任务。在这项工作中,我们的动机是利用单词嵌入和深度学习模型的整合来推进对抗癌肽进行分类的有效模型。首先,Word2Vec,GloVe,FastText,单热编码方法被评估为用于提取肽序列的嵌入技术。然后,嵌入模型的输出被馈送到深度学习方法CNN,LSTM,BiLSTM.为了证明拟议框架的贡献,在文献中广泛使用的数据集上进行了广泛的实验,ACPs250和独立。实验结果表明,与最先进的研究相比,所提出的模型的使用提高了分类精度。拟议的组合,FastText+BiLSTM,ACPs250数据集的准确率为92.50%,独立数据集的准确率为96.15%,从而决定了新的最新技术。
    Anticancer peptides (ACPs) are a group of peptides that exhibit antineoplastic properties. The utilization of ACPs in cancer prevention can present a viable substitute for conventional cancer therapeutics, as they possess a higher degree of selectivity and safety. Recent scientific advancements generate an interest in peptide-based therapies which offer the advantage of efficiently treating intended cells without negatively impacting normal cells. However, as the number of peptide sequences continues to increase rapidly, developing a reliable and precise prediction model becomes a challenging task. In this work, our motivation is to advance an efficient model for categorizing anticancer peptides employing the consolidation of word embedding and deep learning models. First, Word2Vec, GloVe, FastText, One-Hot-Encoding approaches are evaluated as embedding techniques for the purpose of extracting peptide sequences. Then, the output of embedding models are fed into deep learning approaches CNN, LSTM, BiLSTM. To demonstrate the contribution of proposed framework, extensive experiments are carried on widely-used datasets in the literature, ACPs250 and independent. Experiment results show the usage of proposed model enhances classification accuracy when compared to the state-of-the-art studies. The proposed combination, FastText+BiLSTM, exhibits 92.50% of accuracy for ACPs250 dataset, and 96.15% of accuracy for the Independent dataset, thence determining new state-of-the-art.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号