对象分类已被提出作为灵长类腹侧视觉流的主要目标,并已被用作视觉系统的深度神经网络模型(DNN)的优化目标。然而,视觉大脑区域代表许多不同类型的信息,并且仅对对象身份的分类进行优化不会限制其他信息如何在视觉表示中编码。关于不同场景参数的信息可以完全丢弃(\'不变性\'),在种群活动的非干扰子空间中表示(“因式分解”)或以纠缠方式编码。在这项工作中,我们提供的证据表明,因式分解是生物视觉表征的规范原则。在猴子腹侧视觉层次中,我们发现,在更高级别的区域中,对象身份的对象姿态和背景信息的因式分解增加,并且极大地有助于提高对象身份解码性能。然后,我们对单个场景参数的分解进行了大规模分析-照明,背景,摄像机视点,和对象姿态-在视觉系统的不同DNN模型库中。最匹配神经的模型,功能磁共振成像,来自12个数据集的猴子和人类的行为数据往往是最强烈地分解场景参数的数据。值得注意的是,这些参数的不变性与神经和行为数据的匹配并不一致,这表明,在因式分解的活动子空间中维护非类信息通常比完全丢弃它更可取。因此,我们认为视觉场景信息的分解是大脑及其DNN模型中广泛使用的策略。
看图片时,我们可以快速识别一个可识别的物体,比如苹果,对它应用一个单词标签。尽管广泛的神经科学研究集中在人类和猴子的大脑如何实现这种识别,我们对大脑和类似大脑的计算机模型如何解释视觉场景的其他复杂方面的理解-例如对象位置和环境上下文-仍然不完整。特别是,目前尚不清楚物体识别在多大程度上以牺牲其他重要场景细节为代价。例如,可以同时处理场景的各个方面。另一方面,一般物体识别可能会干扰这些细节的处理。为了调查这一点,Lindsey和Issa分析了12个猴子和人脑数据集,以及许多计算机模型,探索场景的不同方面如何在神经元中编码,以及这些方面如何由计算模型表示。分析表明,阻止有效分离和保留有关对象姿势和环境上下文的信息会恶化猴子皮层神经元中的对象识别。此外,最类似大脑的计算机模型可以独立保存其他场景细节,而不会干扰物体识别。研究结果表明,人类和猴子的高级腹侧视觉处理系统能够以比以前所理解的更复杂的方式来表示环境。在未来,研究更多的大脑活动数据可以帮助识别编码信息的丰富程度,以及它如何支持空间导航等其他功能。这些知识可以帮助建立以相同方式处理信息的计算模型,有可能提高他们对现实世界场景的理解。
Object classification has been proposed as a principal objective of the primate ventral visual stream and has been used as an optimization target for deep neural network models (DNNs) of the visual system. However, visual brain areas represent many different types of information, and optimizing for classification of object identity alone does not constrain how other information may be encoded in visual representations. Information about different scene parameters may be discarded altogether (\'invariance\'), represented in non-interfering subspaces of population activity (\'factorization\') or encoded in an entangled fashion. In this work, we provide evidence that factorization is a normative principle of biological visual representations. In the monkey ventral visual hierarchy, we found that factorization of object pose and background information from object identity increased in higher-level regions and strongly contributed to improving object identity decoding performance. We then conducted a large-scale analysis of factorization of individual scene parameters - lighting, background, camera viewpoint, and object pose - in a diverse library of DNN models of the visual system. Models which best matched neural, fMRI, and behavioral data from both monkeys and humans across 12 datasets tended to be those which factorized scene parameters most strongly. Notably, invariance to these parameters was not as consistently associated with matches to neural and behavioral data, suggesting that maintaining non-class information in factorized activity subspaces is often preferred to dropping it altogether. Thus, we propose that factorization of visual scene information is a widely used strategy in brains and DNN models thereof.
When looking at a picture, we can quickly identify a recognizable object, such as an apple, applying a single word label to it. Although extensive neuroscience research has focused on how human and monkey brains achieve this recognition, our understanding of how the brain and brain-like computer models interpret other complex aspects of a visual scene – such as object position and environmental context – remains incomplete. In particular, it was not clear to what extent object recognition comes at the expense of other important scene details. For example, various aspects of the scene might be processed simultaneously. On the other hand, general object recognition may interfere with processing of such details. To investigate this, Lindsey and Issa analyzed 12 monkey and human brain datasets, as well as numerous computer models, to explore how different aspects of a scene are encoded in neurons and how these aspects are represented by computational models. The analysis revealed that preventing effective separation and retention of information about object pose and environmental context worsened object identification in monkey cortex neurons. In addition, the computer models that were the most brain-like could independently preserve the other scene details without interfering with object identification. The findings suggest that human and monkey high level ventral visual processing systems are capable of representing the environment in a more complex way than previously appreciated. In the future, studying more brain activity data could help to identify how rich the encoded information is and how it might support other functions like spatial navigation. This knowledge could help to build computational models that process the information in the same way, potentially improving their understanding of real-world scenes.