关键词: artificial intelligence distributed representations gene function large language models lexical semantics machine learning transformers word embeddings

Mesh : Humans Semantics Natural Language Processing Genes / genetics Gene Ontology Computational Biology / methods Animals

来  源:   DOI:10.1016/j.cels.2024.04.008   PDF(Pubmed)

Abstract:
As words can have multiple meanings that depend on sentence context, genes can have various functions that depend on the surrounding biological system. This pleiotropic nature of gene function is limited by ontologies, which annotate gene functions without considering biological contexts. We contend that the gene function problem in genetics may be informed by recent technological leaps in natural language processing, in which representations of word semantics can be automatically learned from diverse language contexts. In contrast to efforts to model semantics as \"is-a\" relationships in the 1990s, modern distributional semantics represents words as vectors in a learned semantic space and fuels current advances in transformer-based models such as large language models and generative pre-trained transformers. A similar shift in thinking of gene functions as distributions over cellular contexts may enable a similar breakthrough in data-driven learning from large biological datasets to inform gene function.
摘要:
由于单词可以具有取决于句子上下文的多种含义,基因可以有各种功能,取决于周围的生物系统。基因功能的这种多效性受到本体论的限制,在不考虑生物学背景的情况下注释基因功能。我们认为,遗传学中的基因功能问题可能是由自然语言处理中最近的技术飞跃所决定的,其中可以从不同的语言上下文中自动学习单词语义的表示。与1990年代将语义建模为“is-a”关系的努力相反,现代分布语义将单词表示为学习的语义空间中的向量,并推动了基于变压器的模型的当前进步,例如大型语言模型和生成预训练变压器。基因功能在细胞环境中的分布的想法的类似转变可能会在从大型生物数据集中进行数据驱动学习以告知基因功能方面实现类似的突破。
公众号