关键词: N400 event-related brain potentials human language processing information theory language comprehension natural language processing neural language models psycholinguistics surprisal

来  源:   DOI:10.1162/opmi_a_00150   PDF(Pubmed)

Abstract:
Accounts of human language comprehension propose different mathematical relationships between the contextual probability of a word and how difficult it is to process, including linear, logarithmic, and super-logarithmic ones. However, the empirical evidence favoring any of these over the others is mixed, appearing to vary depending on the index of processing difficulty used and the approach taken to calculate contextual probability. To help disentangle these results, we focus on the mathematical relationship between corpus-derived contextual probability and the N400, a neural index of processing difficulty. Specifically, we use 37 contemporary transformer language models to calculate the contextual probability of stimuli from 6 experimental studies of the N400, and test whether N400 amplitude is best predicted by a linear, logarithmic, super-logarithmic, or sub-logarithmic transformation of the probabilities calculated using these language models, as well as combinations of these transformed metrics. We replicate the finding that on some datasets, a combination of linearly and logarithmically-transformed probability can predict N400 amplitude better than either metric alone. In addition, we find that overall, the best single predictor of N400 amplitude is sub-logarithmically-transformed probability, which for almost all language models and datasets explains all the variance in N400 amplitude otherwise explained by the linear and logarithmic transformations. This is a novel finding that is not predicted by any current theoretical accounts, and thus one that we argue is likely to play an important role in increasing our understanding of how the statistical regularities of language impact language comprehension.
摘要:
人类语言理解的帐户提出了单词的上下文概率与处理难度之间的不同数学关系,包括线性,对数,和超对数的。然而,有利于其中任何一个的经验证据都是喜忧参半的,似乎根据使用的处理难度指数和计算上下文概率的方法而有所不同。为了帮助解开这些结果,我们专注于语料库派生的上下文概率与N400之间的数学关系,N400是处理难度的神经指标。具体来说,我们使用37个当代转换语言模型来计算来自6个实验研究的N400刺激的上下文概率,并测试N400振幅是否最好通过线性预测,对数,超对数,或使用这些语言模型计算的概率的次对数变换,以及这些转换后的指标的组合。我们复制了在一些数据集上的发现,线性和对数变换概率的组合可以比单独的任何度量更好地预测N400振幅。此外,我们发现总的来说,N400振幅的最佳单预测因子是次对数变换概率,对于几乎所有的语言模型和数据集,它解释了N400振幅中的所有方差,否则由线性和对数变换解释。这是一个新的发现,没有任何当前的理论预测,因此,我们认为这可能在增加我们对语言统计规律如何影响语言理解的理解方面发挥重要作用。
公众号