Visual tracking

视觉跟踪
  • 文章类型: Journal Article
    已经证明查询解码器在对象检测中实现了良好的性能。然而,他们遭受不足的对象跟踪性能。在这种情况下,序列到序列学习最近被探索,将目标描述为一系列离散令牌的想法。在这项研究中,我们通过实验确定,有适当的代表性,用查询解码器预测目标坐标序列的并行方法可以获得良好的性能和速度。我们提出了一个简洁的基于查询的跟踪框架,用于并行地预测目标坐标序列,名为QPSTrack。一组查询被设计为负责被跟踪目标的不同坐标。所有查询共同表示一个目标,而不是查询和目标之间的传统一对一匹配模式。此外,我们采用自适应解码方案,包括一层自适应解码器和解码器的可学习自适应输入。该解码方案有助于查询更好地解码模板引导的搜索特征。此外,我们探索了普通ViT-Base的使用,ViT-Large,和轻量级分层LeViT架构作为编码器骨干,提供总共三个变体的家族。发现所有跟踪器都在速度和性能之间获得了良好的权衡;例如,我们的跟踪器QPSTrack-B256与ViT-Base编码器在LaSOT基准上以104.8FPS实现了69.1%的AUC。
    Query decoders have been shown to achieve good performance in object detection. However, they suffer from insufficient object tracking performance. Sequence-to-sequence learning in this context has recently been explored, with the idea of describing a target as a sequence of discrete tokens. In this study, we experimentally determine that, with appropriate representation, a parallel approach for predicting a target coordinate sequence with a query decoder can achieve good performance and speed. We propose a concise query-based tracking framework for predicting a target coordinate sequence in a parallel manner, named QPSTrack. A set of queries are designed to be responsible for different coordinates of the tracked target. All the queries jointly represent a target rather than a traditional one-to-one matching pattern between the query and target. Moreover, we adopt an adaptive decoding scheme including a one-layer adaptive decoder and learnable adaptive inputs for the decoder. This decoding scheme assists the queries in decoding the template-guided search features better. Furthermore, we explore the use of the plain ViT-Base, ViT-Large, and lightweight hierarchical LeViT architectures as the encoder backbone, providing a family of three variants in total. All the trackers are found to obtain a good trade-off between speed and performance; for instance, our tracker QPSTrack-B256 with the ViT-Base encoder achieves a 69.1% AUC on the LaSOT benchmark at 104.8 FPS.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    视觉目标跟踪是基于摄像机的传感器网络中的一项重要技术,在自动驾驶系统中具有广泛的实用性。变压器是一种采用自我注意力机制的深度学习模型,并且它对输入数据的每个部分的重要性进行不同的加权。在视觉跟踪领域得到了广泛的应用。不幸的是,变压器型号的安全性尚不清楚。这会导致此类基于变压器的应用程序面临安全威胁。在这项工作中,以自动驾驶为重要组成部分的变压器模型为研究对象,即,视觉跟踪。这种基于深度学习的视觉跟踪很容易受到对抗性攻击,因此,对抗性攻击被实施为安全威胁进行调查。首先,在视频序列之上生成对抗性示例,以降低跟踪性能,并且在对所描绘的跟踪结果产生扰动时考虑逐帧的时间运动。然后,依次研究和分析了扰动对性能的影响。最后,对OTB100,VOT2018和GOT-10k数据集的大量实验表明,执行的对抗性示例对基于变压器的视觉跟踪的性能下降是有效的。白盒攻击表现出最高的效力,对基于变压器的跟踪器,攻击成功率超过90%。
    Visual object tracking is an important technology in camera-based sensor networks, which has a wide range of practicability in auto-drive systems. A transformer is a deep learning model that adopts the mechanism of self-attention, and it differentially weights the significance of each part of the input data. It has been widely applied in the field of visual tracking. Unfortunately, the security of the transformer model is unclear. It causes such transformer-based applications to be exposed to security threats. In this work, the security of the transformer model was investigated with an important component of autonomous driving, i.e., visual tracking. Such deep-learning-based visual tracking is vulnerable to adversarial attacks, and thus, adversarial attacks were implemented as the security threats to conduct the investigation. First, adversarial examples were generated on top of video sequences to degrade the tracking performance, and the frame-by-frame temporal motion was taken into consideration when generating perturbations over the depicted tracking results. Then, the influence of perturbations on performance was sequentially investigated and analyzed. Finally, numerous experiments on OTB100, VOT2018, and GOT-10k data sets demonstrated that the executed adversarial examples were effective on the performance drops of the transformer-based visual tracking. White-box attacks showed the highest effectiveness, where the attack success rates exceeded 90% against transformer-based trackers.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    大多数跟踪器将视觉跟踪表述为常见的分类和回归(即,边界框回归)任务。通过深度卷积或逐通道乘法运算计算的相关特征被输入到分类和回归分支以进行推断。然而,这种使用线性相关方法的匹配计算往往会丢失语义特征,并且只能获得局部最优。此外,这些跟踪器使用不可靠的排名基于分类得分和交集联合(IoU)损失的回归训练,从而降低跟踪性能。在本文中,我们介绍了一种可变形变压器模型,有效地计算训练集和搜索集的相关特征。一种称为质量感知焦点损失(QAFL)的新损失用于训练分类网络;它有效地缓解了分类和定位质量预测之间的不一致性。我们使用一种新的回归损失称为α-GloU来训练回归网络,有效地提高了定位精度。为了进一步提高跟踪器的鲁棒性,通过使用在线学习分数与变压器辅助框架和分类分数的组合来预测候选对象位置。在六个测试数据集上进行的大量实验证明了我们方法的有效性。特别是,所提出的方法在OTB-2015数据集上的成功分数为71.7%,在NFS30数据集上的AUC分数为67.3%,分别。
    Most trackers formulate visual tracking as common classification and regression (i.e., bounding box regression) tasks. Correlation features that are computed through depth-wise convolution or channel-wise multiplication operations are input into both the classification and regression branches for inference. However, this matching computation with the linear correlation method tends to lose semantic features and obtain only a local optimum. Moreover, these trackers use an unreliable ranking based on the classification score and the intersection over union (IoU) loss for the regression training, thus degrading the tracking performance. In this paper, we introduce a deformable transformer model, which effectively computes the correlation features of the training and search sets. A new loss called the quality-aware focal loss (QAFL) is used to train the classification network; it efficiently alleviates the inconsistency between the classification and localization quality predictions. We use a new regression loss called α-GIoU to train the regression network, and it effectively improves localization accuracy. To further improve the tracker\'s robustness, the candidate object location is predicted by using a combination of online learning scores with a transformer-assisted framework and classification scores. An extensive experiment on six testing datasets demonstrates the effectiveness of our method. In particular, the proposed method attains a success score of 71.7% on the OTB-2015 dataset and an AUC score of 67.3% on the NFS30 dataset, respectively.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    目的:手术机器人倾向于开发认知控制架构,以提供一定程度的自主性,以提高患者安全性和手术效果,同时减少所需的外科医生的认知负荷致力于低级决策。认知需要工作空间感知,这是实现自动决策和任务计划能力的重要一步。在微创手术中,强大而准确的检测和跟踪受到可见性有限的影响,闭塞,解剖变形和相机运动。
    方法:本文开发了一种鲁棒的方法来实时检测和跟踪解剖结构,以用于机器人系统的自动控制和增强现实。这项工作的重点是在极具挑战性的手术实验验证:开放脊柱裂的胎儿镜修复。所提出的方法基于两个顺序步骤:首先,使用卷积神经网络选择相关点(轮廓),第二,通过可变形的几何图元重建解剖形状。
    结果:用不同的方案验证了方法性能。综合场景测试,专为极端验证条件而设计,证明该方法在手术过程中相对于标称条件提供的安全裕度。真实场景实验证明了该方法在准确性方面的有效性,鲁棒性和计算效率。
    结论:本文提出了一种针对摄像机突然运动的强大解剖结构检测,严重闭塞和变形。尽管论文的重点是案例研究,打开脊柱裂,该方法适用于所有可以通过几何图元近似轮廓的解剖结构。该方法旨在为需要精确跟踪敏感解剖结构的认知机器人控制和增强现实系统提供有效的输入。
    OBJECTIVE: Surgical robotics tends to develop cognitive control architectures to provide certain degree of autonomy to improve patient safety and surgery outcomes, while decreasing the required surgeons\' cognitive load dedicated to low level decisions. Cognition needs workspace perception, which is an essential step towards automatic decision-making and task planning capabilities. Robust and accurate detection and tracking in minimally invasive surgery suffers from limited visibility, occlusions, anatomy deformations and camera movements.
    METHODS: This paper develops a robust methodology to detect and track anatomical structures in real time to be used in automatic control of robotic systems and augmented reality. The work focuses on the experimental validation in highly challenging surgery: fetoscopic repair of Open Spina Bifida. The proposed method is based on two sequential steps: first, selection of relevant points (contour) using a Convolutional Neural Network and, second, reconstruction of the anatomical shape by means of deformable geometric primitives.
    RESULTS: The methodology performance was validated with different scenarios. Synthetic scenario tests, designed for extreme validation conditions, demonstrate the safety margin offered by the methodology with respect to the nominal conditions during surgery. Real scenario experiments have demonstrated the validity of the method in terms of accuracy, robustness and computational efficiency.
    CONCLUSIONS: This paper presents a robust anatomical structure detection in present of abrupt camera movements, severe occlusions and deformations. Even though the paper focuses on a case study, Open Spina Bifida, the methodology is applicable in all anatomies which contours can be approximated by geometric primitives. The methodology is designed to provide effective inputs to cognitive robotic control and augmented reality systems that require accurate tracking of sensitive anatomies.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    瞳孔大小是人类行为监测的重要生物信号,可以揭示许多潜在信息。本研究探讨了任务负荷的影响,任务熟悉度,以及在学习视觉跟踪任务期间注视瞳孔反应的位置。我们假设瞳孔大小会随着任务负荷的增加而增加,在下降之前达到一定水平,随着任务熟悉度的降低,并且在关注目标之前的区域时比其他区域增加更多。招募了15名参与者,以增加任务负荷进行箭头跟踪学习任务。使用TobiiPro纳米眼睛跟踪器收集瞳孔大小数据。使用R(版本4.2.1)进行2×3×5三因素因子重复测量ANOVA,以评估关键变量对调整后的瞳孔大小的主要和交互影响。个体认知负荷之间的关联,由NASA-TLX评估,使用线性混合效应模型进一步分析瞳孔大小。我们发现任务重复导致瞳孔大小减小;然而,发现随着任务负荷的增加,这种影响会减弱。任务负荷的主要效应接近统计意义,但在试验1和试验2中观察到不同的趋势.在三个注视位置之间未检测到瞳孔大小的显着差异。总体上,瞳孔大小与认知负荷之间的关系呈倒U型曲线。我们的研究显示了瞳孔大小如何随着任务负荷而变化,任务熟悉度,和凝视扫描。这一发现提供了可以改善教育成果的感官证据。
    Pupil size is a significant biosignal for human behavior monitoring and can reveal much underlying information. This study explored the effects of task load, task familiarity, and gaze position on pupil response during learning a visual tracking task. We hypothesized that pupil size would increase with task load, up to a certain level before decreasing, decrease with task familiarity, and increase more when focusing on areas preceding the target than other areas. Fifteen participants were recruited for an arrow tracking learning task with incremental task load. Pupil size data were collected using a Tobii Pro Nano eye tracker. A 2 × 3 × 5 three-way factorial repeated measures ANOVA was conducted using R (version 4.2.1) to evaluate the main and interactive effects of key variables on adjusted pupil size. The association between individuals\' cognitive load, assessed by NASA-TLX, and pupil size was further analyzed using a linear mixed-effect model. We found that task repetition resulted in a reduction in pupil size; however, this effect was found to diminish as the task load increased. The main effect of task load approached statistical significance, but different trends were observed in trial 1 and trial 2. No significant difference in pupil size was detected among the three gaze positions. The relationship between pupil size and cognitive load overall followed an inverted U curve. Our study showed how pupil size changes as a function of task load, task familiarity, and gaze scanning. This finding provides sensory evidence that could improve educational outcomes.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    手动电机性能随年龄下降,但是年龄在多大程度上影响新技能的获得仍然是一个争论的话题。这里,我们研究了老年健康成年人在一次双手捏夹任务中是否比年轻成年人表现出更少的训练依赖表现改善.我们还探讨了身体和认知因素,如握力或运动认知能力,与性能改进相关。健康的年轻人(n=16)和老年人(n=20)进行了三个训练块,由短暂的休息隔开。参与者的任务是使用左右拇指和食指产生视觉指示的捏力变化。通过在夹紧力的双镜像对称和反非对称变化之间进行转换,改变了任务的复杂性。与年轻人相比,老年人在更复杂的逆非对称任务中通常表现出更高的视觉运动力跟踪误差。两组在整个疗程中视觉运动力跟踪误差均显示出相当的净下降,但是他们的改进轨迹不同。年轻人仅在第一块表现出增强的视觉运动跟踪误差,而老年人在三个训练组中表现出更渐进的改善。此外,在运动认知测试电池上的握力和表现在两个年龄组的第一个阻滞期间随着个体表现的改善而呈正比例.一起,结果表明,在双视觉运动技能获得率方面存在微妙的年龄依赖性差异,同时保持整体短期学习能力。
    Manual motor performance declines with age, but the extent to which age influences the acquisition of new skills remains a topic of debate. Here, we examined whether older healthy adults show less training-dependent performance improvements during a single session of a bimanual pinch task than younger adults. We also explored whether physical and cognitive factors, such as grip strength or motor-cognitive ability, are associated with performance improvements. Healthy younger (n = 16) and older (n = 20) adults performed three training blocks separated by short breaks. Participants were tasked with producing visually instructed changes in pinch force using their right and left thumb and index fingers. Task complexity was varied by shifting between bimanual mirror-symmetric and inverse-asymmetric changes in pinch force. Older adults generally displayed higher visuomotor force tracking errors during the more complex inverse-asymmetric task compared to younger adults. Both groups showed a comparable net decrease in visuomotor force tracking error over the entire session, but their improvement trajectories differed. Young adults showed enhanced visuomotor tracking error only in the first block, while older adults exhibited a more gradual improvement over the three training blocks. Furthermore, grip strength and performance on a motor-cognitive test battery scaled positively with individual performance improvements during the first block in both age groups. Together, the results show subtle age-dependent differences in the rate of bimanual visuomotor skill acquisition, while overall short-term learning ability is maintained.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    了解动物的运动和行为可以帮助空间规划并为保护管理提供信息。然而,很难直接观察海洋环境等偏远和敌对地形的行为。可以使用隐马尔可夫模型(HMM)从遥测数据识别不同的底层状态。推断的状态随后与不同的行为相关联,利用该物种的生态知识。然而,由于难以获得“基本事实”行为信息,推断的行为通常不会得到验证。我们通过考虑联合自然保护委员会提供的独特数据集来调查推断行为的准确性。数据包括船的同时代理运动轨迹(定义为视觉轨迹,因为鸟类跟随眼睛)和船上观察者获得的海鸟行为。我们证明视觉跟踪数据适合我们的研究。HMM在雏鸡饲养期间的准确性从71%到87%,在孵化期间的54%到70%,通常对模型选择不敏感。即使AIC值在不同型号之间有很大差异。最后,我们证明了对于觅食,出于保护目的的主要利益状态,识别错过的觅食回合只持续了几秒钟。我们得出的结论是,适合跟踪数据的HMM有可能准确识别重要的保护相关行为,通过比较证明,视觉跟踪数据提供了手动分类行为的“黄金标准”以进行验证。由于这些发现,使用HMM进行行为推断的信心应该会增加,但是需要未来的工作来评估结果的普遍性,我们建议,只要可行,验证数据与GPS跟踪数据一起收集,以验证模型性能。这项工作对动物保护具有重要意义,保护区的大小和位置通常由使用适合移动数据的HMM识别的行为来告知。
    Understanding animal movement and behaviour can aid spatial planning and inform conservation management. However, it is difficult to directly observe behaviours in remote and hostile terrain such as the marine environment. Different underlying states can be identified from telemetry data using hidden Markov models (HMMs). The inferred states are subsequently associated with different behaviours, using ecological knowledge of the species. However, the inferred behaviours are not typically validated due to difficulty obtaining \'ground truth\' behavioural information. We investigate the accuracy of inferred behaviours by considering a unique data set provided by Joint Nature Conservation Committee. The data consist of simultaneous proxy movement tracks of the boat (defined as visual tracks as birds are followed by eye) and seabird behaviour obtained by observers on the boat. We demonstrate that visual tracking data is suitable for our study. Accuracy of HMMs ranging from 71% to 87% during chick-rearing and 54% to 70% during incubation was generally insensitive to model choice, even when AIC values varied substantially across different models. Finally, we show that for foraging, a state of primary interest for conservation purposes, identified missed foraging bouts lasted for only a few seconds. We conclude that HMMs fitted to tracking data have the potential to accurately identify important conservation-relevant behaviours, demonstrated by a comparison in which visual tracking data provide a \'gold standard\' of manually classified behaviours to validate against. Confidence in using HMMs for behavioural inference should increase as a result of these findings, but future work is needed to assess the generalisability of the results, and we recommend that, wherever feasible, validation data be collected alongside GPS tracking data to validate model performance. This work has important implications for animal conservation, where the size and location of protected areas are often informed by behaviours identified using HMMs fitted to movement data.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    视觉跟踪是计算机视觉中的一项重要任务,已应用于各个领域。最近,变压器结构在视觉跟踪中得到了广泛的应用,已成为取代暹罗结构的主流框架。尽管基于变压器的跟踪器在一般情况下已经证明了显著的准确性,他们在闭塞场景中的表现仍然不尽人意。这主要是由于当目标被遮挡时它们不能识别不完整的目标外观信息。为了解决这个问题,我们提出了一种新颖的变压器跟踪方法,称为TATT,集成了目标感知变压器网络和硬遮挡实例生成模块。目标感知变压器网络利用编码器-解码器结构来促进模板和搜索功能之间的交互,提取模板特征中的目标信息,增强搜索特征中目标的未遮挡部分。它可以直接预测目标区域和背景之间的边界以生成跟踪结果。硬遮挡实例生成模块采用多种图像相似度计算方法来选择视频序列中与目标最相似的图像间距,并生成模仿真实场景的遮挡实例,而无需添加额外的网络。五个基准的实验,包括LaSOT,TrackingNet,Got10k,OTB100和UAV123证明了我们的跟踪器在GPU上以约41fps的速度运行时具有良好的性能。具体来说,我们的跟踪器在LaSOT上的部分和完全遮挡评估中获得了65.5和61.2%的最高AUC评分,分别。
    Visual tracking is a crucial task in computer vision that has been applied in diverse fields. Recently, transformer architecture has been widely applied in visual tracking and has become a mainstream framework instead of the Siamese structure. Although transformer-based trackers have demonstrated remarkable accuracy in general circumstances, their performance in occluded scenes remains unsatisfactory. This is primarily due to their inability to recognize incomplete target appearance information when the target is occluded. To address this issue, we propose a novel transformer tracking approach referred to as TATT, which integrates a target-aware transformer network and a hard occlusion instance generation module. The target-aware transformer network utilizes an encoder-decoder structure to facilitate interaction between template and search features, extracting target information in the template feature to enhance the unoccluded parts of the target in the search features. It can directly predict the boundary between the target region and the background to generate tracking results. The hard occlusion instance generation module employs multiple image similarity calculation methods to select an image pitch in video sequences that is most similar to the target and generate an occlusion instance mimicking real scenes without adding an extra network. Experiments on five benchmarks, including LaSOT, TrackingNet, Got10k, OTB100, and UAV123, demonstrate that our tracker achieves promising performance while running at approximately 41 fps on GPU. Specifically, our tracker achieves the highest AUC scores of 65.5 and 61.2% in partial and full occlusion evaluations on LaSOT, respectively.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    基于Transformer的跟踪方法在视觉跟踪方面显示出巨大的潜力,并取得了显著的跟踪性能。传统的基于变压器的特征融合网络将整个特征图划分为多个图像块作为其输入,然后直接并行处理它们,会占用大量的计算资源,影响多头的计算效率。在本文中,在基于Transformer的编码器和解码器体系结构中,设计了一种新颖的特征融合网络,该网络具有优化的多头注意力。设计的特征融合网络对输入特征进行预处理,同时利用高效的多头自注意模块和高效的多头空间约简注意模块,改变多头注意的计算。这两个模块可以减少无关背景信息的影响,增强模板特征和搜索区域特征的表示能力,并大大降低了计算复杂度。基于所设计的特征融合网络,提出了一种新颖的变压器跟踪方法(EMAT)。拟议的EMAT在七个具有挑战性的跟踪基准上进行了评估,以证明其优越性,包括LaSOT,GOT-10K,TrackingNet,UAV123,VOT2018,NFS和VOT-RGBT2019。所提出的跟踪器实现了良好的跟踪性能,在UAV123上获得89.0%的精度分数,在LaSOT上获得64.6%的AUC分数,EAO在VOT-RGBT2019上的得分为34.8%,优于大多数先进的跟踪器。EMAT在跟踪期间以约35FPS的实时速度运行。
    The tracking methods based on Transformer have shown great potential in visual tracking and achieved significant tracking performance. The traditional transformer based feature fusion network divides a whole feature map into multiple image patches as its inputs, and then directly processes them in parallel, which will occupy a lot of computing resources and affect the computing efficiency of multi-head attention. In this paper, we design a novel feature fusion network with optimized multi-head attention in encoder and decoder architecture based on Transformer. The designed feature fusion network preprocess the input features and change the calculations of multi-head attention by using both the efficient multi-head self-attention module and efficient multi-head spatial reduction attention module. The two modules can reduce the influence of irrelevant background information, enhance the representation ability of template features and search region features, and greatly reduce the computational complexity. We propose a novel Transformer tracking method (named EMAT) based on the designed feature fusion network. The proposed EMAT is evaluated on seven challenging tracking benchmarks to demonstrate its superiority, including LaSOT, GOT-10k, TrackingNet, UAV123, VOT2018, NfS and VOT-RGBT2019. The proposed tracker achieves well tracking performance, and obtains precision score of 89.0% on UAV123, AUC score of 64.6% on LaSOT, EAO score of 34.8% on VOT-RGBT2019, which outperforms most advanced trackers. EMAT runs at a real-time speed of about 35 FPS during tracking.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    暹罗跟踪见证了跟踪范式的巨大进步。然而,它的默认箱估计管道仍然面临着一个关键的不一致问题,即,由其分类分数决定的边界框并不总是与地面真相最佳重叠,从而损害业绩。为此,我们探索了一种基于交集联合(IoU)值预测的新型简单跟踪范式。首先要绕过这个不一致的问题,我们提出了一种简洁的目标状态预测器,称为IoUformer,,而不是默认的框估计管道直接预测与跟踪性能指标相关的IoU值。详细来说,它扩展了变压器的远程依赖建模能力,以共同掌握目标模板和搜索区域之间的目标感知交互,并搜索子区域交互,从而巧妙地统一了全局语义交互和目标状态预测。多亏了这种联合力量,IoUformer可以预测可靠的IoU值,与地面实况接近线性,这为我们新的基于IoU的暹罗跟踪范式铺平了一条安全的道路。因为以令人满意的功效和可移植性来探索这种范式是不平凡的,我们提供各自的网络组件和两种替代本地化方式。实验结果表明,基于IoUformer的跟踪器以较少的训练数据获得了有希望的结果。就其适用性而言,它仍然是一个完善的模块,以不断提高现有的高级跟踪器。
    Siamese tracking has witnessed tremendous progress in tracking paradigm. However, its default box estimation pipeline still faces a crucial inconsistency issue, namely, the bounding box decided by its classification score is not always best overlapped with the ground truth, thus harming performance. To this end, we explore a novel simple tracking paradigm based on the intersection over union (IoU) value prediction. To first bypass this inconsistency issue, we propose a concise target state predictor termed IoUformer, which instead of default box estimation pipeline directly predicts the IoU values related to tracking performance metrics. In detail, it extends the long-range dependency modeling ability of transformer to jointly grasp target-aware interactions between target template and search region, and search sub-region interactions, thus neatly unifying global semantic interaction and target state prediction. Thanks to this joint strength, IoUformer can predict reliable IoU values near-linear with the ground truth, which paves a safe way for our new IoU-based siamese tracking paradigm. Since it is non-trivial to explore this paradigm with pleased efficacy and portability, we offer the respective network components and two alternative localization ways. Experimental results show that our IoUformer-based tracker achieves promising results with less training data. For its applicability, it still serves as a refinement module to consistently boost existing advanced trackers.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

公众号