Visual tracking

    Query decoders have been shown to achieve good performance in object detection. However, they suffer from insufficient object tracking performance. Sequence-to-sequence learning in this context has recently been explored, with the idea of describing a target as a sequence of discrete tokens. In this study, we experimentally determine that, with appropriate representation, a parallel approach for predicting a target coordinate sequence with a query decoder can achieve good performance and speed. We propose a concise query-based tracking framework for predicting a target coordinate sequence in a parallel manner, named QPSTrack. A set of queries are designed to be responsible for different coordinates of the tracked target. All the queries jointly represent a target rather than a traditional one-to-one matching pattern between the query and target. Moreover, we adopt an adaptive decoding scheme including a one-layer adaptive decoder and learnable adaptive inputs for the decoder. This decoding scheme assists the queries in decoding the template-guided search features better. Furthermore, we explore the use of the plain ViT-Base, ViT-Large, and lightweight hierarchical LeViT architectures as the encoder backbone, providing a family of three variants in total. All the trackers are found to obtain a good trade-off between speed and performance; for instance, our tracker QPSTrack-B256 with the ViT-Base encoder achieves a 69.1% AUC on the LaSOT benchmark at 104.8 FPS.






    Visual object tracking is an important technology in camera-based sensor networks, which has a wide range of practicability in auto-drive systems. A transformer is a deep learning model that adopts the mechanism of self-attention, and it differentially weights the significance of each part of the input data. It has been widely applied in the field of visual tracking. Unfortunately, the security of the transformer model is unclear. It causes such transformer-based applications to be exposed to security threats. In this work, the security of the transformer model was investigated with an important component of autonomous driving, i.e., visual tracking. Such deep-learning-based visual tracking is vulnerable to adversarial attacks, and thus, adversarial attacks were implemented as the security threats to conduct the investigation. First, adversarial examples were generated on top of video sequences to degrade the tracking performance, and the frame-by-frame temporal motion was taken into consideration when generating perturbations over the depicted tracking results. Then, the influence of perturbations on performance was sequentially investigated and analyzed. Finally, numerous experiments on OTB100, VOT2018, and GOT-10k data sets demonstrated that the executed adversarial examples were effective on the performance drops of the transformer-based visual tracking. White-box attacks showed the highest effectiveness, where the attack success rates exceeded 90% against transformer-based trackers.






    Most trackers formulate visual tracking as common classification and regression (i.e., bounding box regression) tasks. Correlation features that are computed through depth-wise convolution or channel-wise multiplication operations are input into both the classification and regression branches for inference. However, this matching computation with the linear correlation method tends to lose semantic features and obtain only a local optimum. Moreover, these trackers use an unreliable ranking based on the classification score and the intersection over union (IoU) loss for the regression training, thus degrading the tracking performance. In this paper, we introduce a deformable transformer model, which effectively computes the correlation features of the training and search sets. A new loss called the quality-aware focal loss (QAFL) is used to train the classification network; it efficiently alleviates the inconsistency between the classification and localization quality predictions. We use a new regression loss called α-GIoU to train the regression network, and it effectively improves localization accuracy. To further improve the tracker\'s robustness, the candidate object location is predicted by using a combination of online learning scores with a transformer-assisted framework and classification scores. An extensive experiment on six testing datasets demonstrates the effectiveness of our method. In particular, the proposed method attains a success score of 71.7% on the OTB-2015 dataset and an AUC score of 67.3% on the NFS30 dataset, respectively.






    OBJECTIVE: Surgical robotics tends to develop cognitive control architectures to provide certain degree of autonomy to improve patient safety and surgery outcomes, while decreasing the required surgeons\' cognitive load dedicated to low level decisions. Cognition needs workspace perception, which is an essential step towards automatic decision-making and task planning capabilities. Robust and accurate detection and tracking in minimally invasive surgery suffers from limited visibility, occlusions, anatomy deformations and camera movements.
    METHODS: This paper develops a robust methodology to detect and track anatomical structures in real time to be used in automatic control of robotic systems and augmented reality. The work focuses on the experimental validation in highly challenging surgery: fetoscopic repair of Open Spina Bifida. The proposed method is based on two sequential steps: first, selection of relevant points (contour) using a Convolutional Neural Network and, second, reconstruction of the anatomical shape by means of deformable geometric primitives.
    RESULTS: The methodology performance was validated with different scenarios. Synthetic scenario tests, designed for extreme validation conditions, demonstrate the safety margin offered by the methodology with respect to the nominal conditions during surgery. Real scenario experiments have demonstrated the validity of the method in terms of accuracy, robustness and computational efficiency.
    CONCLUSIONS: This paper presents a robust anatomical structure detection in present of abrupt camera movements, severe occlusions and deformations. Even though the paper focuses on a case study, Open Spina Bifida, the methodology is applicable in all anatomies which contours can be approximated by geometric primitives. The methodology is designed to provide effective inputs to cognitive robotic control and augmented reality systems that require accurate tracking of sensitive anatomies.






    Pupil size is a significant biosignal for human behavior monitoring and can reveal much underlying information. This study explored the effects of task load, task familiarity, and gaze position on pupil response during learning a visual tracking task. We hypothesized that pupil size would increase with task load, up to a certain level before decreasing, decrease with task familiarity, and increase more when focusing on areas preceding the target than other areas. Fifteen participants were recruited for an arrow tracking learning task with incremental task load. Pupil size data were collected using a Tobii Pro Nano eye tracker. A 2 × 3 × 5 three-way factorial repeated measures ANOVA was conducted using R (version 4.2.1) to evaluate the main and interactive effects of key variables on adjusted pupil size. The association between individuals\' cognitive load, assessed by NASA-TLX, and pupil size was further analyzed using a linear mixed-effect model. We found that task repetition resulted in a reduction in pupil size; however, this effect was found to diminish as the task load increased. The main effect of task load approached statistical significance, but different trends were observed in trial 1 and trial 2. No significant difference in pupil size was detected among the three gaze positions. The relationship between pupil size and cognitive load overall followed an inverted U curve. Our study showed how pupil size changes as a function of task load, task familiarity, and gaze scanning. This finding provides sensory evidence that could improve educational outcomes.






    Manual motor performance declines with age, but the extent to which age influences the acquisition of new skills remains a topic of debate. Here, we examined whether older healthy adults show less training-dependent performance improvements during a single session of a bimanual pinch task than younger adults. We also explored whether physical and cognitive factors, such as grip strength or motor-cognitive ability, are associated with performance improvements. Healthy younger (n = 16) and older (n = 20) adults performed three training blocks separated by short breaks. Participants were tasked with producing visually instructed changes in pinch force using their right and left thumb and index fingers. Task complexity was varied by shifting between bimanual mirror-symmetric and inverse-asymmetric changes in pinch force. Older adults generally displayed higher visuomotor force tracking errors during the more complex inverse-asymmetric task compared to younger adults. Both groups showed a comparable net decrease in visuomotor force tracking error over the entire session, but their improvement trajectories differed. Young adults showed enhanced visuomotor tracking error only in the first block, while older adults exhibited a more gradual improvement over the three training blocks. Furthermore, grip strength and performance on a motor-cognitive test battery scaled positively with individual performance improvements during the first block in both age groups. Together, the results show subtle age-dependent differences in the rate of bimanual visuomotor skill acquisition, while overall short-term learning ability is maintained.






    Understanding animal movement and behaviour can aid spatial planning and inform conservation management. However, it is difficult to directly observe behaviours in remote and hostile terrain such as the marine environment. Different underlying states can be identified from telemetry data using hidden Markov models (HMMs). The inferred states are subsequently associated with different behaviours, using ecological knowledge of the species. However, the inferred behaviours are not typically validated due to difficulty obtaining \'ground truth\' behavioural information. We investigate the accuracy of inferred behaviours by considering a unique data set provided by Joint Nature Conservation Committee. The data consist of simultaneous proxy movement tracks of the boat (defined as visual tracks as birds are followed by eye) and seabird behaviour obtained by observers on the boat. We demonstrate that visual tracking data is suitable for our study. Accuracy of HMMs ranging from 71% to 87% during chick-rearing and 54% to 70% during incubation was generally insensitive to model choice, even when AIC values varied substantially across different models. Finally, we show that for foraging, a state of primary interest for conservation purposes, identified missed foraging bouts lasted for only a few seconds. We conclude that HMMs fitted to tracking data have the potential to accurately identify important conservation-relevant behaviours, demonstrated by a comparison in which visual tracking data provide a \'gold standard\' of manually classified behaviours to validate against. Confidence in using HMMs for behavioural inference should increase as a result of these findings, but future work is needed to assess the generalisability of the results, and we recommend that, wherever feasible, validation data be collected alongside GPS tracking data to validate model performance. This work has important implications for animal conservation, where the size and location of protected areas are often informed by behaviours identified using HMMs fitted to movement data.






    Visual tracking is a crucial task in computer vision that has been applied in diverse fields. Recently, transformer architecture has been widely applied in visual tracking and has become a mainstream framework instead of the Siamese structure. Although transformer-based trackers have demonstrated remarkable accuracy in general circumstances, their performance in occluded scenes remains unsatisfactory. This is primarily due to their inability to recognize incomplete target appearance information when the target is occluded. To address this issue, we propose a novel transformer tracking approach referred to as TATT, which integrates a target-aware transformer network and a hard occlusion instance generation module. The target-aware transformer network utilizes an encoder-decoder structure to facilitate interaction between template and search features, extracting target information in the template feature to enhance the unoccluded parts of the target in the search features. It can directly predict the boundary between the target region and the background to generate tracking results. The hard occlusion instance generation module employs multiple image similarity calculation methods to select an image pitch in video sequences that is most similar to the target and generate an occlusion instance mimicking real scenes without adding an extra network. Experiments on five benchmarks, including LaSOT, TrackingNet, Got10k, OTB100, and UAV123, demonstrate that our tracker achieves promising performance while running at approximately 41 fps on GPU. Specifically, our tracker achieves the highest AUC scores of 65.5 and 61.2% in partial and full occlusion evaluations on LaSOT, respectively.






    The tracking methods based on Transformer have shown great potential in visual tracking and achieved significant tracking performance. The traditional transformer based feature fusion network divides a whole feature map into multiple image patches as its inputs, and then directly processes them in parallel, which will occupy a lot of computing resources and affect the computing efficiency of multi-head attention. In this paper, we design a novel feature fusion network with optimized multi-head attention in encoder and decoder architecture based on Transformer. The designed feature fusion network preprocess the input features and change the calculations of multi-head attention by using both the efficient multi-head self-attention module and efficient multi-head spatial reduction attention module. The two modules can reduce the influence of irrelevant background information, enhance the representation ability of template features and search region features, and greatly reduce the computational complexity. We propose a novel Transformer tracking method (named EMAT) based on the designed feature fusion network. The proposed EMAT is evaluated on seven challenging tracking benchmarks to demonstrate its superiority, including LaSOT, GOT-10k, TrackingNet, UAV123, VOT2018, NfS and VOT-RGBT2019. The proposed tracker achieves well tracking performance, and obtains precision score of 89.0% on UAV123, AUC score of 64.6% on LaSOT, EAO score of 34.8% on VOT-RGBT2019, which outperforms most advanced trackers. EMAT runs at a real-time speed of about 35 FPS during tracking.






    Siamese tracking has witnessed tremendous progress in tracking paradigm. However, its default box estimation pipeline still faces a crucial inconsistency issue, namely, the bounding box decided by its classification score is not always best overlapped with the ground truth, thus harming performance. To this end, we explore a novel simple tracking paradigm based on the intersection over union (IoU) value prediction. To first bypass this inconsistency issue, we propose a concise target state predictor termed IoUformer, which instead of default box estimation pipeline directly predicts the IoU values related to tracking performance metrics. In detail, it extends the long-range dependency modeling ability of transformer to jointly grasp target-aware interactions between target template and search region, and search sub-region interactions, thus neatly unifying global semantic interaction and target state prediction. Thanks to this joint strength, IoUformer can predict reliable IoU values near-linear with the ground truth, which paves a safe way for our new IoU-based siamese tracking paradigm. Since it is non-trivial to explore this paradigm with pleased efficacy and portability, we offer the respective network components and two alternative localization ways. Experimental results show that our IoUformer-based tracker achieves promising results with less training data. For its applicability, it still serves as a refinement module to consistently boost existing advanced trackers.





