0% found this document useful (0 votes)
198 views

2022 Visual Object Tracking A Survey

This document surveys visual object tracking algorithms. It categorizes trackers into generative, discriminative, and collaborative trackers. Recently, deep learning-based trackers have achieved great success due to outstanding performance. The survey provides a comprehensive overview of state-of-the-art tracking frameworks, including both deep and non-deep trackers. It analyzes tracking results on benchmark datasets and discusses challenges such as occlusion, illumination changes, deformation, and motion blur.

Uploaded by

Paul
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
198 views

2022 Visual Object Tracking A Survey

This document surveys visual object tracking algorithms. It categorizes trackers into generative, discriminative, and collaborative trackers. Recently, deep learning-based trackers have achieved great success due to outstanding performance. The survey provides a comprehensive overview of state-of-the-art tracking frameworks, including both deep and non-deep trackers. It analyzes tracking results on benchmark datasets and discusses challenges such as occlusion, illumination changes, deformation, and motion blur.

Uploaded by

Paul
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Computer Vision and Image Understanding 222 (2022) 103508

Contents lists available at ScienceDirect

Computer Vision and Image Understanding


journal homepage: www.elsevier.com/locate/cviu

Visual object tracking: A survey


Fei Chen a ,∗, Xiaodong Wang a ,∗, Yunxiang Zhao b ,∗, Shaohe Lv c , Xin Niu a
a
College of Computer, National University of Defense Technology, Changsha, 410073, China
b
Beijing Institute of Biotechnology, State Key Laboratory of Pathogen and Biosecurity, Beijing, 100071, China
c
Malanyun Research Institute, Changsha, 410073, China

ARTICLE INFO ABSTRACT


Communicated by Nikos Paragios Visual object tracking is an important area in computer vision, and many tracking algorithms have been
proposed with promising results. Existing object tracking approaches can be categorized into generative
MSC:
trackers, discriminative trackers, and collaborative trackers. Recently, object tracking algorithms based on deep
41A05
41A10
neural networks have emerged and obtained great attention from researchers due to their outstanding tracking
65D05 performance. To summarize the development of object tracking, a few surveys give analyses on either deep or
65D17 non-deep trackers. In this paper, we provide a comprehensive overview of state-of-the-art tracking frameworks
including both deep and non-deep trackers. We present both quantitative and qualitative tracking results of
Keywords:
various trackers on five benchmark datasets and conduct a comparative analysis of their results. We further
Object tracking
discuss challenging circumstances such as occlusion, illumination, deformation, and motion blur. Finally, we
Computer vision
Discriminative trackers list the challenges and the future work in this fast-growing field.
Deep neural networks

1. Introduction occlusion, stability of the camera, and the processing speed of the
trackers.
An object tracking algorithm tracks the object’s position in a 2D We have witnessed the rapid development of visual object track-
or 3D input from devices such as wireless sensor networks (wireless ing in the past few years, from sparse representations based trackers
signal), radar (radar echo), or cameras (video frames). Visual object (e.g., L1T, Mei and Ling, 2009), discriminative trackers (e.g., MIL,
tracking takes a 3D frame sequence as the input to track a target
Babenko et al., 2011, KCF, Henriques et al., 2015), to deep Siamese
object. Given the initialization of a specific target, visual object tracking
network based trackers (e.g., SiamFC, Bertinetto et al., 2016c and
tracks the trajectory of the target in the frame sequence. There are two
manners to optimize the trajectory of a target object during the tracking SiamRPN, Li et al., 2018c). Deep learning (DL) is an emerging tech-
process. The first is offline tracking, and the second is online tracking. nique for computer vision, and deep neural networks (DNN) have
Offline tracking (Smeulders et al., 2014) allows for global optimization shown great success in tasks such as image classification, object de-
of the tracking trajectory and is mainly for multiple objects tracking tection, image segmentation, motion recognition, and face recognition.
tasks. It scans forward and backward through all the frames of a Trackers based on deep learning have also achieved significant im-
sequence during the tracking procedure. Online tracking, on the other provement compared to traditional methods, including the trackers
hand, aims to estimate the states of the target in subsequent frames that combined hand-crafted features with deep features or end-to-end
given the state of the first frame. Different from offline tracking, online tracking frameworks based on deep neural networks.
tracking supports forward scanning only. There are survey papers on visual object tracking. In 2006, Yilmaz
Visual Object tracking can be applied to many applications, such
et al. (2006) proposed a survey that describes a general tracking process
as video surveillance, motion analysis, vehicle navigation, traffic mon-
and classifies tracking approaches based on the shape and appearance
itoring, and automatic robot navigation. For example, in video surveil-
representations of the target, including point tracking, kernel track-
lance, it is necessary to figure out the trajectory of a specific target. In
automatic robot navigation, a mobile robot needs to track and recog- ing, and silhouette tracking. The shape representations include points,
nize a moving object. The performance of the trackers is susceptible primitive geometric shapes, object silhouette and contour, articulated
to circumstances around the target and the variability of the target shape models, and skeletal models. Moreover, the appearance represen-
itself, such as the appearance representation of the target, illumina- tations are probability densities of object appearance, templates, active
tion changes, the moving speed of the target, circumstance changes, appearance models, and multi-view appearance models, providing us

∗ Corresponding authors.
E-mail addresses: chenfei14@nudt.edu.cn (F. Chen), xdwang@nudt.edu.cn (X. Wang), zhaoyx1993@163.com (Y. Zhao).

https://doi.org/10.1016/j.cviu.2022.103508
Received 10 July 2021; Received in revised form 17 April 2022; Accepted 11 July 2022
Available online 19 July 2022
1077-3142/© 2022 Elsevier Inc. All rights reserved.
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508

with a systematic analysis of the whole procedure of visual tracking. Recently, advanced bounding box regression approaches and training
Smeulders et al. (2014) evaluated nineteen different trackers emerged methods further enhance the discriminative ability and accuracy of the
from 1999 to 2012 with multiple evaluation metrics and proposed tracker. The relationship between object detection and visual object
a new dataset ALOV++ that covers more challenging circumstances. tracking has been strengthened. We can see that many state-of-the-art
Zhang et al. (2013) overviewed the trackers based on sparse coding trackers in literature benefit from the object detection technologies,
and conducted an experimental comparison of the results of repre- such as Region Proposal Networks, anchor-based or anchor-free detec-
sentative trackers. Li et al. (2018b) summarized the state-of-the-art tors. In addition, new proposed large scale datasets provide researchers
trackers based on deep learning and give a comprehensive experimental powerful means to build and train deeper and complex neural networks
comparison on three tracking datasets, including OTB-100 (Wu et al., (e.g., SiamRPN++, Li et al., 2019b, SiamDW, Zhang and Peng, 2019)
2015), TC-128 (Liang et al., 2015), and VOT2015 (Kristan et al., 2015). instead of shallow layers (e.g., FCNT, Wang et al., 2015, GOTURN, Held
While many challenging scenarios in visual object tracking need to be et al., 2016, and SINT, Tao et al., 2016) for learning the target model.
studied and addressed, especially with the development of machine Compared with traditional discriminative trackers, most deep trackers
learning and deep learning, the practical application requirements, can achieve higher tracking speeds due to their end-to-end frameworks.
and new large-scale datasets. Fiaz et al. (2019) reviewed recently However, one limitation of existing deep trackers is that they must
proposed trackers based on handcrafted features and deep learning. carefully design and finetune multiple hyper-parameters when tracking
This survey discussed trackers from two main categories: Correlation on a new dataset.
Filter (CF) based trackers and Non-CF based trackers. Marvasti-Zadeh The main contributions of this survey are listed as follows:
et al. (2021) summarized trackers based on deep learning technologies. 1. We give a comprehensive analysis of trackers that cover both the
They categorized the trackers from nine aspects, including network ar- conventional trackers and the deep learning based trackers.
chitecture, network exploitation, network training, network objective, 2. We experimentally evaluate the tracking results of state-of-the-art
network output, exploitation of correlation filter advantages, aerial- trackers on five important benchmark datasets.
view tracking, long-term tracking, and online tracking. Compared to 3. We discuss the eleven challenging scenarios, hyperparameter
previous surveys, the main differences of this survey are: (1) the cate- tuning, and motion models that influence the tracking performance.
gorization method in our paper is different and finer for discriminative 4. We give suggestions for future directions in this area based on
trackers than Fiaz et al. (2019); (2) we include more recent deep the advanced technologies of computer vision and deep learning.
trackers, and provide detailed experimental results on five tracking The rest of the paper is organized as follows. We discuss the gen-
datasets such as OTB-100, VOT, LaSOT, GOT-10k, and TrackingNet; erative trackers in Section 2 and review the discriminative trackers
(3) most of the traditional tracking methods are ignored in Marvasti- in Section 3. We discuss the collaborative trackers in Section 4 and
Zadeh et al. (2021), while we considered both the traditional and more review the trackers based on deep learning in Section 5. We present
recent deep trackers, we provide the detail experimental results for the evaluation methodologies for trackers in Section 6 and summarize
more recent state-of-the-art trackers compared to Marvasti-Zadeh et al. the tracking datasets in Section 7. We present the tracking results in
(2021); (4) the characteristics of each tracker with different attributes Section 8 and summarize challenging circumstances in visual object
are summarized in Table 7. tracking in Section 9. We discuss the model update, motion model,
In this survey, we mainly focus on online tracking algorithms pro- hyperparameter tuning, and tracking speed in Section 9. We discuss
posed in the past ten years, but also consider some early representa- the future directions and open issues in Section 10, and we conclude
tive works. To systematically analyze state-of-the-art trackers, we clas- remarks in Section 11.
sify existing methods into generative trackers, discriminative trackers,
collaborative trackers, and deep learning based trackers. Generative 2. Generative trackers
trackers establish the appearance model of the target in a continuous
sequence to search the region most similar to the target object. The In this section, we give a comprehensive review of generative track-
appearance model plays a key role in the process of tracking. Many ers. Given the initial information (e.g., ground truth bounding box)
approaches have been developed for learning an appearance model, of the target object in the first image, the generative trackers mainly
such as subspace learning, sparse representations, spatio-temporal mo- search for the most similar region in the test image. Generally, the
tion energy, and boolean map representations. While discriminative generative trackers can be categorized into template matching based
trackers formulate the visual object tracking as a binary classification trackers and particle filter based trackers (detailed in Sections 2.1 and
problem, in which a classifier is trained to distinguish the target from 2.2, respectively). In particle based trackers, the target state can be
the background. The representative methods include MIL (Babenko determined by the maximum a posteriori estimation given the previous
et al., 2011), Struck (Hare et al., 2016), DLSSVM (dual linear Structured observation set of the target up to the current frame. This process is
SVM) (Ning et al., 2016), and correlation filter based trackers. The mainly related to a state transition model and an observation model
trackers equip with a correlation filter significantly accelerate the (detailed in Section 2.2). In addition, multiple appearance models can
learning and detection with convolution theorem and Fast Fourier be integrated into the particle filtering framework conveniently.
Transform (FFT) (Bolme et al., 2010). To adapt to the appearance
variation of a target, these trackers allow the classifiers or correlation 2.1. Generative trackers based on template matching
filters (CFs) used by them to be trained online. Collaborative trackers
take advantage of both generative and discriminative ways for tracking. In template matching, an object is represented by the target tem-
Benefiting from the great success of deep learning technologies in plates, and the tracking process is performed by matching the repre-
computer vision, a tremendous number of deep trackers have been sentation of the target in the current frame with previous templates
proposed, which can be classified into two categories: (1) the first via a similarity measure method. For example, a fast normalized cross-
category combines deep features with traditional tracking algorithms; correlation algorithm is applied in NCC (Briechle and Hanebeck, 2001)
(2) the second category is implemented with deep neural networks to perform template matching to reduce the computation load directly.
and trained in an end-to-end manner. Early trackers exploited to uti- In KLT (Baker and Matthews, 2004), the affine warp is utilized to find
lize deep features from a pre-trained model (e.g., imagenet-vgg-2048 the matching between the template image 𝑇 (𝑥) and input image 𝐼(𝑥).
network, Chatfield et al., 2014) on other visual tasks. Afterward, sev- In KAT (Nguyen and Smeulders, 2004), the appearance features are
eral works formulated trackers as a convolutional network for similar smoothed temporally by robust Kalman filters to update the template
object detection or to regress to the target location directly, which for handling occlusions. Then the target matching is performed by
can be trained end-to-end in an offline phase with large-scale datasets. optimizing the robust error function (Hager and Belhumeur, 1998;

2
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508

Baker and Matthews, 2004) between two image patches, which makes with a set of additional new images. However, the update efficiency
the error metric more robust to the outliers. significantly influences the tracking performance. In IVT (Ross et al.,
The constraint of trackers based template matching is that they 2008), an incremental learning algorithm was proposed to update mean
mainly exploit the spatial information through template but less adap- and eigenbasis accurately and efficiently in the low-dimensional sub-
tivity when the target undergoes non-rigid transformations. In LOT space. Such a learning strategy can incrementally incorporate new data
(Avidan et al., 2015), the utilized Locally Orderless Matching in the with less space and time complexity and adapt online to appearance
Earth Mover’s Distance (EMD) optimization problem is more robust to changes of the target. In Fig. 1, we give a detailed illustration of the in-
occlusion and non-rigid transformations compared to IVT (Ross et al., cremental learning and tracking based on particle filtering framework.
2008), MIL (Babenko et al., 2011), and TLD (Kalal et al., 2011). Yu et al. PMT (Zhang et al., 2014a) employed the part based appearance model
(2016) integrated multiple features into a unified similarity measure to represent the target and the part matching methods to obtain the
and the relationship among samples in the forthcoming frame is utilized occlusion information of individual parts. In TAG (Kwon et al., 2009),
to enhance this similarity measure. By extending the similarity measure the 2D affine motion was reformulated as the particle filtering problem
Best-Buddies Similarity (BBS) proposed in Dekel et al. (2015), Liu on the 2D affine group 𝐴𝑓 𝑓 (2).
et al. (2018) applied the Mutual Buddies Similarity (MBS) to find the
optimal candidate region that suits the template, which is updated via 2.2.2. Sparse representation
a memory filtering strategy. The difference between MBS and BBS is There is extensive literature on sparse representation based trackers,
that the former utilizes the similarity metric MBP between two patches, in which the target can be represented as a sparse linear combination of
which is computed among multiple reciprocal nearest neighbors. While target templates. To handle the occlusion problem, Mei and Ling (2009)
the similarity metric BBP used in BBS is the special case of BBS where firstly proposed the sparsity representation composed of a set of target
the number of nearest neighbors of a patch is 1. and trivial templates for visual object tracking. The sparsity can be ob-
tained by solving an 𝓁1 -regularized optimization problem. Meanwhile,
2.2. Generative trackers based on particle filter the dictionary templates update dynamically according to appearance
changes to preserve robustness. Afterward, many related works have
The particle filter or Sequential Monte Carlo (SMC) (Isard and Blake, been done to improve the performance of sparse representation models.
1998) model, which approximates the posterior probability of state For example, the sparsity-inducing 𝓁𝑝,𝑞 mixed norms (𝑝 ≥ 1, 𝑞 ≥ 1)
variables with a finite set of particles sampled from the state space, pro- based Multi-task learning in MTT (Zhang et al., 2012a), multi-view
vides a convenient framework for visual object tracking (Arulampalam (e.g., color, shape and texture) based multi-task learning that exploits
et al., 2002; Isard and Blake, 1998) due to their generality, flexibility, both the relationship between particles and views in MTMVT (Hong
and simple implementation. In addition, the particle filters used for et al., 2013), subspace learning with PCA and sparse representation
propagating sample distributions over time are effective to handle the for appearance model in Wang et al. (2013), a discriminative sparse
non-Gaussianity and multi-modality problems. It consists of two stages, similarity (DSS) map that combines coefficients of positive and negative
namely prediction and update. Given all available observations 𝑧1∶𝑡−1 = templates constructed in Zhuang et al. (2014), nonlocal regularized
{ }
𝑧1 , 𝑧2 , … , 𝑧𝑡−1 , the prediction stage predicts the posterior probability multi-view sparse representation in NR-MVDLSR (Kang et al., 2019),
of 𝑥𝑡 with state transition model 𝑝(𝑥𝑡 |𝑥𝑡−1 ) as follows: the joint structural sparse appearance model in SST (Zhang et al.,
2015), and metric-weighed linear representation in Li et al. (2016b).
𝑝(𝑥𝑡 |𝑧1∶𝑡−1 ) = 𝑝(𝑥𝑡 |𝑥𝑡−1 )𝑝(𝑥𝑡−1 |𝑧1∶𝑡−1 )𝑑𝑥𝑡−1 (1)

CLREST (Zhang et al., 2014b) exploited to learn a robust linear
At time 𝑡, the observation 𝑧𝑡 is available, and the state vector is
representation by solving a low-rank, consistent, and sparse learning
updated using Bayesian rule as:
problem, which corresponds to the nuclear norm, 𝓁2,1 norm, and 𝓁1
𝑝(𝑧𝑡 |𝑥𝑡 )𝑝(𝑥𝑡 |𝑧1∶𝑡−1 ) norm, respectively. More specifically, the low-rank representation can
𝑝(𝑥𝑡 |𝑧1∶𝑡 ) = (2)
𝑝(𝑧𝑡 |𝑧1∶𝑡−1 ) be implemented with nuclear norm ‖𝐙‖∗ to force a joint representation
where 𝑝(𝑧𝑡 |𝑥𝑡 ) is the observation likelihood, and 𝑝(𝑧𝑡 |𝑧1∶𝑡−1 ) denotes of particles of the target rather than independent, where 𝐙 is the repre-
the normalizing constant. The required posterior 𝑝(𝑥𝑡 |𝑧1∶𝑡 ) in Eq. (2) sentation of the corresponding observation 𝐗 of particles. Afterwards,
is approximated by a set of samples (‘particles’) {𝑥𝑖𝑡 }𝑖=1,…,𝑁 with as- the representation in the current frame 𝐗 and the previous frame 𝐗0
sociated weights 𝜔𝑖𝑡 . The candidate samples 𝑥𝑖𝑡 are drawn from an are compared using 𝓁2,1 norm denoted as ‖𝐙 − 𝐙0 ‖2,1 , in which the
importance distribution 𝑞(𝑥𝑡 |𝑥1∶𝑡−1 , 𝑧1∶𝑡 ), where a number of sampling 𝓁2,1 norm encourages the temporal consistency of the representations
algorithms can be used such as importance sampling (Isard and Blake, of tracking results, where 𝐙 and 𝐙0 are the representations of current
1998), Sequential Importance Sampling (SIS) (Doucet et al., 2001), and previous frames, separately. There is an example of this tracker
Rao-Blackwellized Particle Filter (Khan et al., 2004), and ISPF (Li et al., in Fig. 2. Due to the limitation of 𝓁𝑝 norm regularized least square
2012). optimization that does not consider the correlation between feature
In visual object tracking, to improve the robustness of trackers, dimensions, which is important for object/non-object classification
particle filters are widely used to estimate the target states from a with complicated appearance variations, Li et al. (2016b) introduced
time series state space model, which enables the trackers to deal with the metric-weighted linear representation learned with a discriminative
appearance changes effectively. Different kinds of appearance models Mahalanobis metric matrix. In addition, Chen et al. (2017b) proposed
and features can be incorporated into such framework, such as subspace a dynamically modulated mask sparse tracking method, in which the
learning (e.g., IVT, Ross et al., 2008; Sui et al., 2018), sparse representa- mask templates produced by frame difference can model the corruption
tion (e.g., L1T, Mei and Ling, 2009 and BPR-L1, Mei et al., 2011), and on target more precisely than trivial templates that utilized in early
motion energy combined with color histogram in Zhou et al. (2014), sparse representation based trackers.
etc. We summarize the classical works as follows.
2.2.3. Multi-task learning
2.2.1. Subspace learning Particle filters for tracking improve the robustness of a tracker by
In visual object tracking, the subspace representation of the target sampling a sufficient number of samples. While a dense sampling strat-
object, usually a eigenbasis computed from the singular value decom- egy generally leads to a high computational load. Besides the sparse
position, provides a compact notion of the ‘thing’ being tracked rather representation, multi-task learning has been used in MTT (Zhang et al.,
than treating the target as a set of independent pixels in an image. To 2012a), MTMVT (Hong et al., 2013), and MCPF (Zhang et al., 2017c).
adapt to appearance changes, the tracker needs to retrain the eigenbasis Fig. 3 shows the example of the structure of the learned coefficient

3
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508

Fig. 1. Illustration of subspace learning example in IVT (Ross et al., 2008), where MAP denotes the Maximum a Posteriori estimation used for searching the most likely particle.
𝑈 ′ , 𝜎 ′ , and 𝐼̄𝑐 are the updated basis vectors, singular values, and mean vectors, respectively. During the tracking process in frame 𝑡, a set of particles (candidate states) are drawn
from the particle filter with the dynamical model, which is implemented as a Gaussian model. For each particle 𝑖, we estimate its likelihood of observation 𝑝(𝑍𝑡𝑖 |𝑋𝑡𝑖 ). Then, the
target state can be determined by computing the MAP among all particles. Finally, the tracking result can be used to update the appearance model.

Fig. 3. Illustration for the multi-view and multi-task learning example (Hong et al.,
Fig. 2. Illustration of Low rank and sparse learning in CLREST (Zhang et al., 2014b).
2013). The 𝑣𝑖𝑒𝑤1, 𝑣𝑖𝑒𝑤2, and 𝑣𝑖𝑒𝑤3 in dictionary M represent different features. The
z⃗𝑂 ⃗𝐵𝑖 represent the target object and background templates respectively. We can
𝑖 and z row sparse matrix P and column sparse matrix Q construct the overall coefficient matrix
see that the columns of coefficients matrix 𝑍 are sparse and few dictionary templates
with respect to dictionary M.
are used for representing the particles (e.g., particle 𝑥⃗𝑖 , 𝑥⃗𝑗 , and 𝑥⃗𝑘 ).

et al. (2014) employed multiple cues including spatio-temporal motion


matrices. MCPF combines multi-task correlation filters (MCF) with par-
energy and color distributions. The motion energy can be computed
ticle filters for visual object tracking. Leveraging the MCF can shepherd
as the filter response in the format of pixel-wise rectification and
the sampled particles toward the modes of the target state distribution
summation as:
and reduce the number of particles required for tracking. Compared
with conventional correlation filters, MCPF can handle scale variation 𝐸(𝑢; 𝜃, 𝛾) = [𝐺2 (𝜃, 𝛾) ∗ 𝐼(𝑢)]2 + [𝐹2 (𝜃, 𝛾) ∗ 𝐼(𝑢)]2 (3)
via the particle sampling scheme and deal with partial occlusion via a
part-based representation. where 𝐺2 and 𝐹2 denote the three dimensional Gaussian second deriva-
tive filters and the corresponding Hilbert transform, respectively. Pa-
2.2.4. Motion energy rameters 𝜃 denote the 3D direction of the filter axis and 𝛾 is the scale
In visual object tracking, exploiting the relationships of adjacent of a Gaussian pyramid (Adelson and Bergen, 1985). During the track-
frames in a video sequence is also important to capture the dynamic ing process, the reliable information generated by motion energy can
characteristics of a moving target. Motion energy, which complements provides consistent measurements than color features. For illumination
color features, is a comprehensive descriptor for representing the spa- changes and temporal occlusions, the feature of motion energy can
tial appearance and motion characteristics (Cannons et al., 2010). Zhou improve the robustness of the tracker.

4
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508

Instead of using the boosting method to learn a binary classifier,


Struck (Hare et al., 2016) employed a kernelized structured output SVM
framework (Tsochantaridis et al., 2005) to learn a prediction function
for adaptive tracking. Fig. 4 shows the paradigms of MIL and Struck
trackers. Instead of using individual labeled examples, MIL trains the
classifier with positive and negative bags. Note that the Haar-like fea-
tures were used to form each weak classifier, and they can be computed
by using integral image (Viola and Jones, 2001). Hua et al. (2015)
proposed a tracking framework with proposal selection, in which the
geometric transformations of the target object are estimated to enrich
the candidate set. During online tracking, three cues, including edges,
motion boundaries, and detection scores, are combined to select the
best proposal from the candidate set. Another kind of classifier for
visual object tracking is based on deep learning, which we will discuss
in Section 5.1.1.

3.2. Trackers based on discriminative correlation filters

Fig. 4. Illustration for comparison of Struck (Hare et al., 2016) and MIL (Babenko
Recently, Correlation Filter (CF) based trackers have drawn much
et al., 2011). MIL learns the classifier with positive and negative bags, while the Stuck
avoids the labeling procedure and trains the classifier directly on the tracking output attention due to their high efficiency of computation and adaptivity.
(bounding box). The developments of DCF for tracking can be categorized into two
trends, one is the application of improved features, such as the deep
features used in CCOT (Danelljan et al., 2016), ECO (Danelljan et al.,
2.2.5. Other appearance models 2017a), and DeepSTRCF (Li et al., 2018a), the depth information used
Inspired by the primary visual cortex (area V1), Zhang et al. (2017a) in Chen et al. (2017a); the other is the theoretical innovation in
proposed a hierarchical structure of five layers appearance model and learning of filters such as MOSSE (Bolme et al., 2010), CCOT (Danelljan
performed the visual tracking within the particle filter framework. et al., 2016), CSR-DCF (Lukežič et al., 2017), ECO (Danelljan et al.,
Zhang et al. (2018d) represented the target object with the Boolean 2017a), DeepSTRCF (Li et al., 2018a), and MCPF (Zhang et al., 2017c),
maps that are generated by thresholding the HOG and color feature including the different regularization methods and optimization algo-
maps, where different granularities of Boolean maps capture multi- rithms.
scale connectivity cues of the target. In CNT (Zhang et al., 2016), a
two-layer convolutional network is utilized to extract object features 3.2.1. Single-channel filters
without offline training. However, CNT easily fails to localize the target In this section, we introduce the early DCF tracker that learned the
object when its appearance changes significantly due to motion blur or single-channel filter with single-channel features for tracking (Bolme
out-of-view. et al., 2010). In detail, our goal is to learn the correlation filter 𝐰 from
The appearance model of generative trackers plays an important an image patch 𝐱 of 𝑀 × 𝑁 pixels as in MOSSE (Bolme et al., 2010).
role, and great efforts have been put to improve the tracking perfor- All the circular shifts of 𝐱𝑚,𝑛 , (𝑚, 𝑛) ∈ {0, 1, … , 𝑀 − 1} × {0, 1, … , 𝑁 − 1}
mance. The success of deep learning in recent years, especially the are generated as training samples with Gaussian function label 𝑦(𝑚, 𝑛).
powerful representative ability, helps us to build a more robust and Then, we can find the minimizer 𝐰 by solving the following ridge
accurate appearance model. We will talk about the deep trackers in regression problem:
Section 5. ∑
min ‖𝐰𝑇 𝐱𝑖 − 𝐲𝑖 ‖2 + 𝜆‖𝐰‖2 (4)
𝐰
𝑖∈(𝑚,𝑛)
3. Discriminative trackers
where 𝜆 is the regularization parameter. The circular convolution in
For each frame of the sequence, the purpose of discriminative objective function (4) equals to the correlation operation in spatial
trackers (Wang et al., 2009; Henriques et al., 2012; Kalal et al., 2011; domain as:
Hua et al., 2015; Babenko et al., 2011; Nam and Han, 2016; Song et al.,
min ‖𝐰 ⋆ 𝐱 − 𝐲‖2 + 𝜆‖𝐰‖2 (5)
2018) is to learn a discriminative classifier that separates the target 𝐰
from the background. We summarized different types of discriminative where ⋆ denotes the circular correlation. The correlation operation
trackers according to their ways of model learning, as shown in Fig. 5. in the spatial domain equals the element-wise multiplication in the
In this section, we mainly discuss the conventional discriminative Fourier domain according to Parseval’s theorem. Therefore, the objec-
trackers that are not constructed based on deep neural networks as tive function (5) can be expressed as:
follows.
̂ 𝐱̂ ∗ − 𝐲‖
min ‖𝐰◦ ̂ 2 + 𝜆‖𝐰‖
̂ 2 (6)
𝐰̂
3.1. Trackers based on classifiers
where√ ̂ denote the Discrete Fourier Transform (𝐃𝐅𝐓) of a signal, such as
A popular kind of tracking approach is called tracking-by-detection, 𝐱̂ = 𝑇 𝐹 𝐱, and the constant matrix 𝐹 is the 𝐃𝐅𝐓 matrix. The ◦ denotes
in which a discriminative classifier is trained online to separate the the Hadamard product in the frequency domain, and ∗ denotes the
target object from the background. For example, FBT (Nguyen and complex conjugate. The filter 𝐰̂ in Eq. (6) can be computed efficiently
Smeulders, 2006) trained the foreground/background discrimination as 𝐰̂ = 𝐱̂ ∗ ◦𝐲∕(
̂ 𝐱̂ ∗ ◦𝐱̂ + 𝜆).
dynamically with texture features, MIL (Babenko et al., 2011) used the
Haar-like features (Viola and Jones, 2001) and Online-MILBoost (Zhang 3.2.2. Multi-channel filters
et al., 2006) algorithm to train a boosting classifier, and SST (Song To improve the tracking accuracy and robustness, there are works
et al., 2016) trained the classifier with the self-similarity learned fea- on multi-channel correlation filters based trackers. Different from the
tures and linear SVM. In addition, object detection has been signifi- single-channel filters that are learned with the single-channel feature
cantly improved with the development of machine learning techniques, (e.g., gray image), the multi-channel filters are learned with multi-
which further improves the performance of tracking methods. channel features (e.g., HOG features, RGB images, Haar-like features).

5
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508

Fig. 5. A summary of traditional discriminative trackers. This graph summarizes different kinds of traditional discriminative trackers, including classifiers, DCF based trackers, and
Spectral filter based trackers. Multi-channel filter based trackers with regularization technologies can be categorized into four kinds according to the type of regularization method.

MCCF (Kiani Galoogahi et al., 2013) extended single-channel filters and combining part-based models. We summarize these trackers as
of MOSSE (Bolme et al., 2010) to multi-channel filters, so as to be follows.
applied to multi-channel images (i.g., RGB images) or features (i.g.,
HOG). Staple (Bertinetto et al., 2016b) applied the HOG and color • Regularization Methods
features jointly to learn a model that is inherently robust to both the
In order to alleviate the boundary effects due to periodic assumption
shape deformation and color changes.
of samples, SRDCF (Danelljan et al., 2015b) and its variant Deep-
Apart from using different features, the kernel tricks (Henriques
SRDCF (Danelljan et al., 2015a) added a spatial weight function on
et al., 2015), multi-feature and multi-kernel (Tang and Feng, 2015; the filters to penalize the magnitude of the filter coefficients depending
Choi et al., 2016), scale adaptive kernel (Li and Zhu, 2014), long-term on the spatial locations. In DeepSTRCF (Li et al., 2018a), Feng et al.
tracking (Ma et al., 2015c), and multi-types of correlation filters (Ma introduced a temporal regularization term to SRDCF (Danelljan et al.,
et al., 2018) were proposed to improve the capability of DCF further. 2015b) with a single sample, in which ADMM (Boyd et al., 2011)
Henriques et al. (2015) employed kernelized correlation filters (KCF) is used to obtain a globally optimal solution instead of large linear
for real-time tracking. Tang and Feng (2015) extended KCF to multi- equations and Gauss–Seidel solver. CSCT (Fan et al., 2018) combined
kernel correlation filters (MKCF) with multiple features, in which the a dual-color clustering model and a novel spatio-temporal regulariza-
power-law (Dollár et al., 2014) is used to rescale the feature instead tion (Li et al., 2018a) to improve the robustness and discriminative
of rescaling the image patch to speed up the determination of the ability of the tracker. ASRCF (Dai et al., 2019) adaptively learned the
target scale during tracking. Inspired by the ensemble trackers (Avidan, spatial weight online based on BACF (Galoogahi et al., 2017b), which
2007; Wang and Yeung, 2014), Zhang and Suganthan (2017) proposed guides the tracker to learn more reliable filters. (See Fig. 6.)
a co-trained KCF (COKCF) tracker which consists of two KCF based sub- Conventional DCF trackers train the filters with examples generated
trackers (each with different kernel functions) to improve robustness in a small search region that contains very limited context information.
to complex backgrounding and significant appearance changes, where To obtain more real negative examples, CACF (Mueller et al., 2017) and
each of the KCF is associated with a different kind of feature (whether BACF (Galoogahi et al., 2017b) both considered the context information
the handcrafted feature or deep feature). ACFT (Ma et al., 2018) made for learning the filter. These two approaches apply different context
use of gradient-based features (HOG), intensity-based features (HOI), patch production strategies. CACF prepared the context patches before
and deep features to learn three kinds of correlation filters, including the training while BACF integrated the cropping operator into the
translation filter, scale filter, and long-term filter. objective function. However, the context patches produced by hand
Besides applying multi-channel feature maps to learn correlation context selection strategy in CACF are more limited than that in BACF.
filters, there are various techniques that can be incorporated into the In PAC (Zhang et al., 2018a), the Boolean maps (Zhang et al.,
learning process, such as learning filters in a continuous spatial domain, 2018d) and DCF with distractor-resilient metric regularization were
employing different regularizations, exploiting context information, combined to play the role of spatial selective attention and appearance

6
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508

Fig. 6. Visualization of the spatial regularization weights (a) in SRDCF (Danelljan et al., 2015b) and adaptive spatial regularization weights (b, c, d) in ASRCF (Dai et al., 2019),
and the corresponding image region used for training. The spatial region corresponding to the background features is assigned a large penalty in 𝑤 and vice versa. The spatial
regularization in SRDCF assigns similar values for locations that with equal distance to the sample center and the values are fixed during the tracking process, while the ASRCF
learns to change the spatial regularization weights according to different target objects.

the following problem:



𝑚
‖ ‖2 ∑ ‖
𝐷
𝐸(𝑓 ) = 𝛼𝑗 ‖𝑆𝑓 {𝑥𝑗 } − 𝑦𝑗 ‖ + ‖2 (9)
‖ ‖ ‖𝜔𝑓𝑑 ‖
𝑗=1 𝑑=1

where the coefficient 𝛼𝑗 denotes the importance of each sample 𝑥𝑗 and


𝜔 is the penalty function to regularize the correlation filters in the
continuous spatial interval [0, 𝑇 ). Fig. 7 gives the tracking procedure
based on such continuous filters.
ECO (Danelljan et al., 2017a) was proposed to reduce the high
computation cost of CCOT (Danelljan et al., 2016) and address the over-
fitting problem due to the large number of optimized parameters in
Eq. (9) above. Danelljan et al. (2017a) employed a factorized convolu-
tion approach that uses a coefficients matrix 𝑃 to construct the filter
{ 𝑑 }𝐷 ∑ { 𝑙 }𝐶
𝑓 𝑑=1 in CCOT as 𝑓 𝑑 = 𝐶 𝑙
𝑙=1 𝑝𝑑,𝑙 𝑓 , in which 𝑓 𝑙=1 is a set of base
Fig. 7. Example of continuous convolution filters for tracking (Danelljan et al., 2016),
where the multiple-resolution deep feature maps are from multiple convolutional layers. filters where 𝐶 < 𝐷. The factorized convolution operator can be defined
as:
∑ { }
𝑆𝑃 𝑓 {𝑥} = 𝑃 𝑓 ∗ 𝐽 {𝑥} = 𝑝𝑑,𝑐 𝑓 𝑐 ∗ 𝐽𝑑 𝑥𝑑 = 𝑓 ∗ 𝑃 𝑇 𝐽 {𝑥} (10)
selective attention (Ungerleider and Kastner, 2000). The spatial selec- 𝑐,𝑑

tive attention describes the target and its scene, and the appearance Then, the Fourier coefficients of new convolution operator becomes
∑𝐷
selective attention enhances the discriminative capability of the filters. 𝑆̂
𝑃 𝑓 {𝑥} is
̂𝑑 𝑑 ̂
𝑑=1 𝑃 𝑓 𝑋 𝑏𝑑 , and we can reformulate Eq. (9) in the
Fourier domain as:
• Continuous Filters
‖ ‖2 ∑‖ 𝐷
‖2
𝐸(𝑓 , 𝑃 ) = ‖𝑧̂ 𝑇 𝑃 𝑓̂ − 𝑦̂‖ 2 + ‖𝜔̂ ∗ 𝑓̂𝑑 ‖ 2 + 𝜆 ‖𝑃 ‖2𝐹 (11)
The filter learning in conventional DCF is based on the assumption ‖ ‖𝓁 ‖ ‖𝓁
𝑑=1
that all feature channels must have the same spatial resolution. Most
trackers usually utilized the resampling strategy to make sure the all where 𝑧̂ 𝑑 [𝑘] = 𝑋 𝑑 [𝑘] 𝑏̂ 𝑑 [𝑘], and 𝜆 is the weight parameter of new added
feature channels have same resolution, while such a manner could regularization term.
inevitably introduce artifacts and the feature alignment problem. In
• Part-based Methods
order to efficiently fuse multi-resolution feature maps, Danelljan et al.
proposed a novel approach called CCOT (Danelljan et al., 2016). It ex- In order to improve the robustness of trackers against occlusion, a
tended conventional DCF to learn the correlation filters in a continuous number of part-based DCF trackers have been proposed recently. Liu
spatial domain, which enables the integration of multi-resolution fea- et al. (2015) divided the target region into several parts, in which
tures in an implicit way, instead of explicitly resampling the different KCF was applied on each part of the object. DRT (Sun et al., 2018a)
feature channels to the same resolution. developed the DCF by jointly exploiting the discrimination and relia-
The goal is to learn the convolution operator 𝑆𝑓 which corresponds bility information, where the element-wise product of a base filter and
a set of convolutional filters 𝑓 = (𝑓 1 , … , 𝑓 𝐷 ) ∈ 𝐿2 (𝑇 ), where 𝑓 𝑑 ∈
a reliability term are used to construct the correlation filter. By con-
𝐿2 (𝑇 ) is the continuous filter of channel 𝑑 in a continuous interval
sidering the influence of the occlusion, Wang et al. (2018a) proposed
[0, 𝑇 ). For the 𝑑th feature channel, the interpolation operator 𝐽𝑑 ∶
an occlusion-aware part-based tracking framework that combined the
R𝑁𝑑 → 𝐿2 (𝑇 ) is defined as:
global model and part-based model. To help further understand the
𝑁𝑑 −1 ( )
∑ 𝑇 core idea of the part-based model, we show two different dividing
𝐽𝑑 {𝑡} = 𝑥𝑑 [𝑛] 𝑏𝑑 𝑡 − 𝑛 (7) strategies in Fig. 8.
𝑛=0
𝑁𝑑
where 𝑏𝑑 is the interpolation function, 𝑁𝑑 is the dimension of the 3.3. Spectral filters
𝑑th feature channel of 𝑗th sample 𝑥𝑑𝑗 ∈ R𝑁𝑑 . Thus, we define the
convolutional operation as: Similar to DCF trackers that employ various features to learn cor-

𝐷
{ } relation filters in the Fourier domain, Cui et al. (2019) represented
𝑆𝑓 {𝑥} = 𝑓 𝑑 ∗ 𝐽𝑑 𝑥𝑑 , 𝑥 ∈  (8) the pixel-wise spatial structure of the image region as an undirected
𝑑=1 weighted graph and learned the spectral filters on it. Based on graph
where ∗ denotes convolution in continuous domain. Given a set of , they proposed a Spectral Filter Tracking (SFT) tracking framework
training pairs {𝑥𝑗 , 𝑦𝑗 }𝑚
𝑗=1
the continuous filters can be learned by solving and defined a graph Laplacian operator  = 𝐷 − 𝑊 , where 𝐷 is the

7
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508

Fig. 8. Example of part-based_model for tracking, where 𝑎 and 𝑏 are correspond to Wang et al. (2018a) and Sun et al. (2018a), respectively. Both of them divide the image
into multiple parts with different strategy. We can see that tracking process of DRT in 𝑏 produce a meaningful reliability map that indicates the weights of different parts. The
reliability map can be used to generate more discriminative filters for tracking.

In SCM (Zhong et al., 2012), the holistic training set consisted of


the downsampled positive and negative templates was used to learn
the sparsity-based discriminative classifier (SDC). The representative
dictionary composed of the image patches of the target object in the
first frame was used to learn a sparsity-based generative model (SGM).
Then the confidence value from the SDC model and the similarity
score from the SGM model were combined through the multiplication
operation to generate the final probability map.
Ma et al. (2015b) exploited how to fuse multiple features (intensity
image, color histogram, and Haar feature) to improve tracking perfor-
mance. Given the three types of features that provide complementary
Fig. 9. Illustration showing the spectral filter for tracking (Cui et al., 2019). The information adapting to variations of the target object, the weighted
extracted deep or hand-crafted feature maps are first converted into a dense grid graph, entropy is utilized to fuse these features discriminatively. Hence, the
where each vertex in the graph is associated with a multi-channel feature vector, and weighted entropy of the combined state’s evaluation is formulated as:
the edges are constructed through the spatial adjacent relations of these vertexes. 𝜃𝑖
denotes the learned coefficients of the 𝑖th spectral filter basis. During tracking, the ∑
𝑁
responses are computed by spectral filtering on the graph of the test image. 𝐻(𝜃) = −𝑆(𝜃) 𝜔𝑖 𝑝𝑖 log 𝑝𝑖 (13)
𝑖=1

where the 𝜔𝑖 denotes the weight of 𝑖th candidate state, and the 𝑆(𝜃) =
∑ ∑
diagonal degree matrix, and 𝐷𝑖𝑖 = 𝑗 𝑊𝑖𝑗 . The normalized graph  can
1∕ 𝑖 𝜔𝑖 = 1 is utilized to normalize the sum of weighted entropy.
be computed as: MUSTer (Hong et al., 2015) applies long-term and short-term mem-
1 1 1 1
ory schemes for tracking, where the Integrated Correlation Filters (ICF)
𝑛𝑜𝑟𝑚 = 𝐷 2 𝐷− 2 = 𝐼 − 𝐷 2 𝑊 𝐷− 2 (12) are used for short-term tracking, and the local features (e.g., SIFT,
Given the input feature 𝐱, the frequency filtering on which can be Lowe, 2004) are used for keypoints matching in long-term tracking.
defined as 𝑧(𝜆 ̂ 𝑙 ), where 𝜆𝑙 is the 𝑙th spectrum of graph
̂ 𝑙 )𝑔(𝜆
̂ 𝑙 ) = 𝐱(𝜆 Zhao et al. (2016) combined the discriminative global and genera-
, and 𝑔(𝜆̂ 𝑙 ) is the spectrum filter that needed to be learned. This tive local appearance models to construct a more robust and discrim-
procedure is the same as the way of computing response in DCF. The inative tracker. The global representation was obtained by extracting
SFT operates on the local regions around a pixel (i.g., a vertex), which the feature from the color and texture of the target. The generative
can improve the robustness to local variations and occlusion. Please local model exploited the scale invariant feature transform and spatial
note that, in this method, filter function 𝑔(⊙)
̂ is approximated by the geometric information. During tracking, the global model and local
∑ model were integrated into the Bayesian approach to estimate the
Chebyshev polynomials as 𝑔(𝜆 ̂ 𝑙 ) = 𝐾−1 ̃
𝑘=0 𝜃𝑘 𝑇𝑘 (𝜆𝑙 ), where eigenvalues
{𝜆𝑙 } are scaled and shifted as 𝜆̃𝑙 = 2∕𝜆𝑚𝑎𝑥 𝜆𝑙 − 1 to make them fall in posterior probability as the Eq. (2) in Section 2.2.
[−1, 1]. (See Fig. 9.) In DLRT (Sui et al., 2018), both generative and discriminative
To summarize, DCF trackers are computationally efficient and are information are employed for target localization. The linear classifier
easy to be integrated with other models. However, DCF trackers have learned in the subspace is used to distinguish the target from the
the following limitations: (i) they tend to be affected by boundary neighboring background patches. The observation model in DLRT (Sui
effects, background clutter, and occlusion; (ii) they have constrained et al., 2018) for target localization is formulated as:
{ }
capacity for target scale estimating under severe non-rigid deformation; 1
𝑝(c|st ) ∝ exp − (‖1 − g(c; 𝐏, 𝐰, b)‖ + 𝜌𝛿(c; 𝐏)) (14)
(iii) degraded tracking speed when combined with the high dimen- 𝑙
sional deep features. which denotes the likelihood of a candidate c to be the target given
the state st , where 𝑔(c; ⋅) denotes the normalized reliability with linear
4. Collaborative trackers classifier defined by (𝐰, b). 𝛿(c; ⋅) denotes the reconstruction error with
the learned subspace 𝐏 in generative view.
From the previous sections, most trackers belong to either gener-
ative models or discriminative models. Yu et al. (2008), Zhong et al. 5. Trackers based on deep learning
(2012), Ma et al. (2015b), and Zhao et al. (2016) explore the fusion
strategies to combine the generative model and discriminative model With the growing popularity of using deep learning in a wide range
into a stronger tracker. of computer vision tasks, some researchers focus their attention on

8
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508

appearance variations in the temporal domain. The generator is used


to predict a weight mask M for identifying the discriminative features.
The entropy distribution of testing frames indicates that the classifier
trained with adversarial learning tends to focus on more robust fea-
tures, compared with the entropy distribution of the classifier trained
without adversarial learning. The trackers like MDNet (Nam and Han,
2016) and VITAL are constrained by the specific datasets or domains,
and the training time in the pretraining process grows significantly with
the increase of branch number (number of sequences in the training
dataset). Although there is no need pretraining process with external
datasets in BranchOut compared to MDNet, multiple branches increase
the computational complexity for both tracking and model update.
Fig. 10. The flowchart of multi-domain learning for tracking (Nam and Han, 2016;
Jung et al., 2018), in which the red and yellow dotted lines correspond to the flows of
MDNet and RT-MDNet (fast version of MDNet), respectively. The yellow and blue grids 5.1.2. Siamese network based trackers
in domain-specific layers means that there are two classes in each of the branches. Siamese architecture is a general framework that was first de-
scribed in Bromley et al. (1994) and used for signature verification.
The Siamese network has two identical sub-networks to compare two
developing trackers with deep neural networks. In deep learning based patterns, which takes two signals as inputs and outputs the similarity
trackers, different neural networks can be used for feature extraction, of two patterns. Similarly, the Siamese network based trackers aim to
object detection, similarity measuring, and state estimation. Recently, learn an appearance model that can maximize the distance between
trackers based on deep learning such as SiamFC (Bertinetto et al., two patches of different objects and minimize the distance between two
2016c), EAST (Huang et al., 2017a), CFNet (Valmadre et al., 2017), patches of the same object. Fig. 11 illustrates this kind of trackers. We
MDNet (Nam and Han, 2016), HDT (Qi et al., 2016), BranchOut (Han summarize the variety of deep trackers based on the Siamese network
as follows.
et al., 2017), DaSiamRPN (Zhu et al., 2018a), SiamRPN++ (Li et al.,
SINT (Tao et al., 2016) is one of the early attempts to learn a
2019b), ATOM (Danelljan et al., 2019), and DiMP50 (Bhat et al., 2019)
matching function with the Siamese network in the offline phase. They
have proved that they outperform the traditional trackers by a large
designed a Siamese architecture that consists of two identical convo-
margin, in terms of accuracy, robustness and speed. In this section,
lutional neural networks. In deep convolutional networks, the higher
we first overview the mainstream deep trackers based on different
layers capture high-level features whereas the lower layers extract the
network structures, then we analyze the training approaches used in
low-level visual features. Therefore, SINT (Tao et al., 2016) applied
deep trackers.
multiple layers as the object’s representation to adapt to the target’s
appearance variations. However, SINT (Tao et al., 2016) achieved the
5.1. Trackers based on different network structure
tracking speed of 2 fps, which is far from being real-time. EAST (Huang
et al., 2017a) exploited a decision-making agent learned by reinforce-
5.1.1. Classifiers based trackers
ment learning (Caicedo and Lazebnik, 2015; Mnih et al., 2015) that
Similar to traditional trackers based on tracking-by-detection, deep
can dynamically decide whether to locate the target on early layers to
neural networks provide the possibility to learn the feature representa-
speed up the tracking process.
tion and train the classifier for visual object tracking in an end-to-end
Recently, SiamFC (Bertinetto et al., 2016c) and SiamRPN (Li et al.,
manner. Recently, MDNet (Nam and Han, 2016) combined multi-
2018c) achieve high tracking speed beyond real-time due to their fully-
domain learning (Dredze et al., 2010; Duan et al., 2009) and convo- convolutional Siamese network. The structural relationships
lutional neural networks (CNN) to learn a shared representation model among the local patterns of the target and local pattern detection
for object tracking, which consists of shared layers and 𝐾 branches for module were exploited by Zhang et al. (2018e) to deal with non-rigid
the last fully-connected layers, corresponding to 𝐾 domains. In Bran- appearance change, rotation, partial occlusion, and motion blur. While
chOut (Han et al., 2017), Han et al. employed multiple branches with these trackers do not update the appearance model online, which is
fully-connected layers on shared convolutional layers. Each branch critical for drastic appearance changes during tracking. To improve
may have different depths. In addition, BranchOut utilizes stochastic the generalization capability of SiamFC (Bertinetto et al., 2016c), He
ensemble learning for the model update, in which it selects a subset of et al. (2018) proposed a twofold Siamese network (SA-SIAM) which
branches randomly for the model update. Based on MDNet (Nam and contains an appearance branch (A-Net) and a semantic branch (S-
Han, 2016), Wang et al. (2018b) proposed a structure-aware network Net). CFNet (Valmadre et al., 2017) is proposed for joint learning
to capture the relationships among training samples. The visual and of deep features and DCF tracker in an end-to-end manner. Unlike
natural language cues are combined to shepherd the global proposal SiamFC (Bertinetto et al., 2016c) that only compares the following
generation network toward the locations that are most likely to contain frames with the initial template of the first frame, CFNet computes
the target. To speed up the procedure of feature extraction, Jung et al. a new template in the following frames and combines it with the
(2018) utilized the RoIAlign (He et al., 2017) module to generate the previous template in a moving average. While CFNet cannot achieve
candidates from the feature map. (See Fig. 10.) competitive accuracy compared to the advanced DCF trackers such as
Generative adversarial learning has attracted much attention from CCOT (Danelljan et al., 2016) and ECO (Danelljan et al., 2017a).
both deep learning and computer vision communities. It has been By combining the Siamese network, DCFNet (Wang et al., 2017),
successfully integrated into multiple applications such as face syn- and Spatial Transformer Networks (STN) (Jaderberg et al., 2015),
thesis (Huang et al., 2017b), image-to-image translation (Isola et al., Zhang et al. (2018f) constructed a network with spatially aligned
2017), and image super-resolution (Ledig et al., 2017). The key idea is correlation filters. The spatial alignment module that estimates the
to learn deep representations without extensively annotated training global transformation of a target in two consecutive frames provides the
data (Creswell et al., 2018). It consists of two models: a generator coarse localization of the target. Then, the correlation filters perform
𝐺 tries to capture the real data distribution, and a discriminator 𝐷 the fine-grained localization on the transformed image. These two
distinguishes the images that are either from the training data or from complementary components work together to address the issues such as
the generator. For example, VITAL (Song et al., 2018) employed ad- boundary effects and aspect ratio variations, to achieve promising per-
versarial learning to augment training samples, which not only exploits formance against the BACF (Galoogahi et al., 2017b) and ECO (Danell-
the most robust features over a long temporal span but also captures jan et al., 2017a).

9
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508

Fig. 11. The flowchart of Siamese network based learning for tracking, where ⋆ denotes the correlation operation and BBR represents the Bounding Box Regression. In this kind
of tracker, the shared backbone network (e.g., VggNet, ResNet, and GoogLeNet) is used to extract feature representation for both the template and search branches. Then, the
response is calculated by the correlation operation between the features of the two branches. Most recent deep Siamese network based trackers (e.g., SiamRPN++, SiamBAN) employ
additional classification and BBR sub-networks to improve the robustness and accuracy of the tracking result. The final tracking result is determined by selecting the bounding
box with the highest classification score. Other ways of information interaction can also be applied such as global matching between graphs in Guo et al. (2021), cross-attention
in Chen et al. (2021) and Wang et al. (2021b).

PrDiMP50 (Danelljan et al., 2020) took the advantages of ATOM


(Danelljan et al., 2019) (Overlap Maximization by an IoU predictor)
and DiMP50 (Bhat et al., 2019) (Predicting a discriminative model),
and proposed a probabilistic regression model to find the more correct
target center. In Ocean (Zhang et al., 2020), a feature alignment module
was proposed to encourage sampling positions from the fixed region by
convolution to the whole predicted region (e.g., bounding box). This
kind of module provides a global feature for distinguishing the target
and background. Besides the classification and regression branches,
Fig. 12. Illustration of tracking procedure in DiMP (Bhat et al., 2019). It can be seen both SiamCAR (Guo et al., 2020) and SiamFC++ (Xu et al., 2020)
that this tracker learns the target model 𝑓 with a set of feature maps of the training set added a centerness branch in parallel with classification to estimate
collected from previous frames. The model optimizer is implemented with as steepest the quality of each bounding box, in which the regression target of
descent method which need only a few optimization steps to find the optimal model. ̃ 𝑡̃, 𝑟̃, 𝑏)
̃ is defined
centerness score for a ground truth bounding box 𝑡̃ = (𝑙,
as:

In recent years, many efforts have been devoted to improving the ̃ 𝑟̃)
min(𝑙, ̃
min(𝑡̃, 𝑏)
𝐶(𝑖, 𝑗) = × (15)
feature representation in deep trackers, especially based on the Siamese max(𝑙,̃ 𝑟̃) max(𝑡̃, 𝑏)
̃
networks such as SiamRPN (Li et al., 2018c), which takes the advan-
tage of Region Proposal Network (RPN) to produce accurate bounding 5.1.3. Ensemble trackers
boxes. DaSiamRPN (Zhu et al., 2018a) introduced a distractor-aware Most existing trackers have shown significant improvement in track-
module to enhance the discriminative ability of the tracker, by taking ing performance, but most trackers are dealing with specific domain
the potential distractors into the embedding space learning process. problems such as robustness, occlusion, or trackers’ discriminative
Dong and Shen (2018) utilized the triplet loss function on the Siamese ability. Ensemble learning is a machine learning paradigm in which
network instead of the pairwise loss function in SiamFC (Bertinetto multiple learners are learned jointly to solve the same problem. Some
et al., 2016c) to learn more powerful features, which can take full ensemble trackers employ this principle to construct a robust and strong
advantage of the positive score and the negative score by a matching tracker by combining a set of weak trackers.
probability, while the performance still cannot compete with the top HDT (Qi et al., 2016) is an ensemble tracker that aims to deploy
DCF tracker such as ECO (Danelljan et al., 2017a). hierarchical CNN features. An adaptive online decision learning algo-
In addition, SiamRPN++ (Li et al., 2019b), SiamDW (Zhang and rithm was used to hedge weak trackers into stronger ones, as shown in
Peng, 2019), and MFFSiam (Yang et al., 2019a) exploit to utilize Fig. 13. Given the test image 𝑋 𝑖 , the corresponding feature map of 𝑘th
deeper backbone networks such as the ResNet (He et al., 2016) to layer is 𝑆𝑘𝑖 , the response of 𝑘th weak tracker is 𝐿𝑘 = 𝐹 −1 (𝐹 (𝑆𝑘𝑖 ) ⋅ 𝜔𝑘 ),
extract features. SiamRPN++ replaces the Up-Channel Cross Correla- where 𝐹 and 𝐹 −1 denote the 𝐃𝐅𝐓 operation and the inverse of it
tion (UP-XCorr) layer in SiamRPN with a lightweight Depthwise Cross respectively, 𝜔𝑘 is the 𝑘th filter in Fourier domain. Then the final target
Correlation (DW-XCorr) layer. Moreover, aggregating the multi-layer location can be computed with a parameter-free Hedge algorithm as:
features further improves the accuracy of the tracker. In SiamDW, they
designed a new module named cropping-inside residual (CIR) units, ∑
𝐾
(𝑥∗𝑡 , 𝑦∗𝑡 ) = 𝑤𝑘𝑡 ⋅ (𝑥𝑘𝑡 , 𝑦𝑘𝑡 ) (16)
which can help to construct deeper and wider backbone networks.
𝑘=1
In MFFSiam, Yang et al. (2019a) proposed a feature agglomeration
module to fuse multi-level feature maps, integrating the deeper network where 𝑤𝑘𝑡 and (𝑥𝑘𝑡 , 𝑦𝑘𝑡 ) are the weight and location of the 𝑘th expert, re-
into the Siamese network. Voigtlaender et al. (2020) utilized the Faster spectively. Although applying the multi-layer deep features to construct
R-CNN (Ren et al., 2015) as the backbone network and replace the the base tracker, HDT adopts the simple CF model for tracking.
category-specific detection head with a re-detection head to classify the To improve the accuracy of the localization, the bounding box
region produced by RPN. regression technique (Girshick et al., 2014) was used in deep ensemble
During tracking process, DiMP dynamically learned the target model trackers (Wang et al., 2016; Han et al., 2017) to predict a new detection
with samples collected from history frames shown in Fig. 12. window given the deep features. The structure of these trackers shares

10
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508

Fig. 13. The flowchart of ensemble learning for tracking (Qi et al., 2016). The feature
maps from different convolution layers construct the weak trackers. During tracking
process, the responses of weak trackers are weighed and summed to predict the target Fig. 15. The example of conv-LSTM module for learning of the target filter in Yang
position, where the weights of each weak trackers are updated by hedging algorithm. and Chan (2017). The conv-LSTM takes the exemplar feature 𝑒𝑡 as input and produces
the target object filter 𝑓𝑡 for tracking in the next frame. Both the initial hidden state
ℎ0 and cell state 𝑐0 are produced by feeding the initial exemplar feature map 𝑒0 of the
first frame into a convolutional layer.

from partial occlusion or local appearance variation can be alleviated


Fig. 14. Illustration of RNNs for modeling the structure of target object (Fan and Ling, by using quad-directional RNNs to traverse the spatial candidate region
2017b), where (e), (𝑓 ), (g), and (h) are the four directed acyclic graphs associated with from four angles. Similarly, Structure-Aware Network (SANet) (Fan and
different RNNs.
Ling, 2017b) deploys multi-directional RNNs to model the self-structure
of the target object and incorporate such structure information into
CNNs to improve its robustness. Fig. 14 illustrates an example of RNNs
the motivation that the component trackers are based on an identical that model the self-structure of the object, in which the topology of
basic framework. For example, the branches in BranchOut (Han et al., an undirected cyclic graph is divided into four directional RNNs. What
2017) share the same convolutional layers, and each of them is fol- also inspired us is that they adopt short-term and long-term schemes
lowed by one or more connected layers. A common problem of trackers jointly to update the model as Nam and Han (2016).
based on deep CNNs is the scarcity of training samples. STCT (Wang Yang and Chan (2017) proposed a recurrent filter learning (RFL)
et al., 2016) attempted to transfer the pre-trained deep model to online method that applies a convolutional LSTM for the model update, the
tracking tasks through a sequential training method, in which each structure of which was shown in Fig. 15. Similar to the Siamese
channel of the output feature map was treated as a base learner and network, RFL consists of two convolutional networks (i.g., E-CNN and
trained online. These base learners were sequentially selected into the S-CNN), corresponding to the exemplar and search image patches, re-
ensemble set via an important sampling strategy. Moreover, instead of spectively. The weights of LSTM are trained with convolutional filters,
exploiting scale space tracking on the feature pyramid (Danelljan et al., which help capture spatial relationships between features. However,
2014), STCT used a scale prediction network (SPN) that trained with RFL still fails to bring a significant improvement in the performance
deep features to predict the current scale. due to its update scheme. MemTrack (Yang and Chan, 2018) employed
TSN (Teng et al., 2017) consists of three sub-networks including a LSTM as a memory controller to control the memory reading and
Feature Net, a Temporal Net, and a Spatial Net. The feature Net is used writing, which is more effective compared with RFL (Yang and Chan,
to extract the general features of the target, and the Temporal Net is 2017).
designed to exploit the temporal and global information of previous
frames by tuple learning. More specifically, the Temporal Net selects 5.1.5. Attention mechanism based trackers
a tuple of key target models from historical frames and predicts the The attention (Corbetta and Shulman, 2002; Ungerleider and Kast-
similarity between current proposals and the tuple. During tracking, the ner, 2000) mechanism is widely used in various deep learning based
Temporal Net generates several similar candidates for the Spatial Net. applications, such as image captioning (Xu et al., 2015), machine
Since it is not sufficient to only consider global information captured translation (Bahdanau et al., 2014), and visual object tracking (Choi
by a temporal net, a Spatial Net is designed to take the features from et al., 2016; Teng et al., 2017; Kosiorek et al., 2017; Wang et al., 2018;
lower layers to generate a more accurate confidence map. Zhu et al., 2018b; Chen et al., 2019). Instead of compressing an entire
image into a static representation, the attention mechanism allows for
5.1.4. RNN based trackers salient features to dynamically come to the forefront.
Recurrent Neural Networks (RNNs) have shown great performance Kosiorek et al. (2017) combined the LSTM model with two attention
for processing sequential data such as speech recognition, video analy- mechanisms (i.e. spatial attention and appearance attention) for visual
sis, machine translation, and image captioning. Long-Short Term Mem- tracking, in which the spatial attention is used to select the region as
ory (LSTM) network was first proposed by Hochreiter and Schmidhuber the attention glimpse and the appearance attention is used to suppress
(1997) in 1997, which aims to address the gradient vanishing problem distractors around the target object. Inherently, the attention mecha-
in RNNs. Some works attempted to address the tracking problem by nism used here aims to select the most region we are interested in and
tracking the advantages of RNNs in dealing with sequential data. In discard the cluttering background regions.
visual object tracking, the model drifting phenomenon is a long-term SCT (Choi et al., 2016) learns attentional feature based correlation
plaguing problem due to the accumulation and propagation of estima- filters that focus on distinctive attentional features. The novelty of
tion error. Cui et al. (2016) proposed a Recurrently Target-attending the framework is applying different kinds of features and kernels in
Tracking (RTT) method, which employs multi-directional Recurrent two stages: disintegration and integration. In the disintegration stage,
Neural Networks (RNNs) to model the spatial relationship between multiple attentional correlation filters (AtCFs) were trained with var-
the target and the background. However, RTT only uses the HOG ious feature types and kernel types. In the integration stage, the final
feature and does not consider the global appearance model which is response was aggregated by the priority and reliability of each response
also critical for dealing with target deformation. The contamination of AtCFs. The attentional weight map for a specific kind of feature was

11
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508

cross-attention is used for aggregating contextual interdependencies


between the target template and the search image. In CGACD (Du et al.,
2020), the spatial attention module and channel attention module are
obtained from pixel-wise correlation and channel-wise correlation in
the second stage to enhance the features of RoI generated from the first
Siamese tracking module. In this way, CGACD improves the accuracy of
corner detection and the estimation of the final bounding box. Similar
to non-local self-attention which captures longer-range dependencies in
HASiam (Yang et al., 2019b), STMTrack (Fu et al., 2021) computes the
Fig. 16. Illustration for the heat maps of residual attention (Wang et al., 2018). The attention among the features of a set of history templates and current
residual attention is implemented as a hourglass-like network (Newell et al., 2016). query frame. Then the weighted target feature concatenates with the
The general attention map is similar to a Gaussian distribution. query feature to generate the aggregated feature for classification and
bounding box regression.
Transformer, as a special kind of attention mechanism, was first
proposed by Vaswani et al. (2017) for machine translation. The basic
blocks in the transformer are the self-attention and cross-attention
modules, stacking in the encoder and decoder. Recently, Transformer
has been applied to multiple vision tasks such as performing object
detection in DETR (Carion et al., 2020) and image generation in Image
Transformer (Parmar et al., 2018). Wang et al. (2021b) and Chen
et al. (2021) applied Transformer to model the relationship between
template and search patches. In TransT (Chen et al., 2021), the self-
attention and cross-attention modules in the transformer were jointly
utilized to predict target location instead of correlation operation in
most Siamese network based trackers. Wang et al. (2021b) separately
applied the encoder and decoder on two branches of the Siamese
Fig. 17. The example of spatial and channel attention (Yang et al., 2019b), where network to generate more discriminative features. The encoder in the
GMP denotes the global max-pooling operation. template branch receives a set of templates from previous frames and
aggregates them into a more informative new template for tracking.
Fig. 18 illustrates the architectures of the non-local and two Trans-
estimated by the attentional weight estimator trained with a partially formers. The tracking performance demonstrated that Siamese network
growing decision tree (PGDT). Choi et al. (2017) proposed a deep based trackers equipped with the transformer module can promote both
attentional network to select a set of filters with more diversities. the template and search branches.
Moreover, a large variety of tracking modules cover more appearances In SiamGAT (Guo et al., 2021), the feature maps of two branches
and dynamic changes of the target. of the Siamese network are modeled as two subgraphs for target infor-
RASNet (Wang et al., 2018) incorporates three kinds of attention mation embedding by a graph attention module (GAM). Unlike global
mechanisms (i.e. general attention, residual attention, and channel matching in cross-correlation, the GAM module exploits the part-to-
attention) into a Siamese network to boost the discriminative and adap- part correspondence between the target template and the search region,
tive ability of the tracker. The purposes of the three parts are to capture which aims to adapt to shape-and-pose variances of the target object.
common characteristics, learn distinctions of targets in different videos,
and reflect the channel-wise quality of features, respectively. (See 5.1.6. Reinforcement learning based trackers
Fig. 16 for the residual attention and general attention.) The core idea of Reinforcement Learning (RL) is to learn a sequence
Optical flow (Horn and Schunck, 1981; Dosovitskiy et al., 2015) of policies that decide actions to maximize a numerical reward sig-
is an important technology in motion estimation, which describes the nal (Sutton and Barto, 1998). Deep Q-Network (Mnih et al., 2013) and
relative motion information for each pixel that can be used for multiple policy gradient methods (Sutton et al., 2000) are two representative
visual tasks such as pose estimation, object recognition, visual track- algorithms in reinforcement learning. The deep Q-network presented
ing, and video recognition. FlowTrack (Zhu et al., 2018b) deployed a in Mnih et al. (2015) combines Q-learning with deep neural networks
spatial–temporal attention module that employs the flow information to approximate an action-value function. ADNet (Yun et al., 2017) is
to learn correlation filters. The cosine similarity metric measures the the first work that utilizes the RL for visual tracking. Different from
similarity between the flow warped features and the features of the the dense candidate sampling strategy in MDNet (Nam and Han, 2016),
specified 𝑡 − 1 frame. First, the spatial attention weights were computed ADNet (Yun et al., 2017) applied reinforcement learning with a policy
by cosine similarity and softmax operation. Then, the weights of the gradient method to train an Action-Decision network, which needs
spatial attention module are re-calibrated by temporal attention to fewer searching steps to find the target location. A Markov Decision
generate the output of the temporal attention module. To employ multi- Process (MDP) is exploited to model the tracking process, where the
level information of the target object, Chen et al. (2019) designed a MDP is defined by a state space , an action space , a state transition
multiple attention module (MAM) to guide the candidate sampling pro- function 𝑠𝑡 = 𝑓 (𝑠𝑡−1 , 𝑎), and a reward function 𝑅(𝑠, 𝑎), where 𝑠 ∈
cedure, including temporal-attention, spatial-attention, channel-wise , 𝑎 ∈ . The next state 𝑠𝑡 is obtained by the state transition function.
attention, and layer-wise attention. One limitation of MAM is that the The action-based tracking approach helps to reduce the computation
high computation cost caused by the use of multi fully-connected layers complexity in tracking and can work well even the training data are
degrades the tracking speed. partially labeled.
In addition to spatial and channel-wise attention as shown in Ren et al. (2018) proposed a deep reinforcement learning with
Fig. 17, HASiam (Yang et al., 2019b) also employed a non-local iterative shift (DRL-IS) method for visual tracking. They adopted an
attention module to exploit the structural dependencies over the local Actor-Critic (Konda and Tsitsiklis, 2000) network (refer to Fig. 19 for
features. In SiamAttn (Yu et al., 2020), the self-attention captures detail) to predict the object motion and select actions based on the
the context information via spatial attention and selectively enhances tracking status. Whereas the previous trackers such as ADNet (Yun
interdependent channel-wise features with channel attention, and the et al., 2017), EAST (Huang et al., 2017a), and POMDP (Supančič III

12
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508

Fig. 18. The examples of non-local attention modules, where (a) is the non-local attention used in HASiam (Yang et al., 2019b), (b) and (c) refer to Transformer modules for
fusing the template and search patches in Chen et al. (2021) and Wang et al. (2021b), respectively. The differences in (b) and (c) are that the latter encodes a history of template
features and their Gaussian-shaped masks in Decoder module, where the masks are used for Mask Transformation (top-right in (c)) and Feature Transformation (top-left in (c)).

Such an objective function is different from the conventional Siamese


networks, which apply the same model 𝜙(𝑥; 𝑊 ) with shared weights 𝑊
to both 𝑥𝑖 and 𝑧𝑖 . The experimental results in Bertinetto et al. (2016a)
have validated the effectiveness of the Siamese learnet module.
More general, Park and Berg (2018) studied how to learn the target
model that generalizes well over future frames through meta-learning.
The meta-training algorithm can be integrated into many deep trackers
such as MDNet (Nam and Han, 2016), CREST (Song et al., 2017), and
SiamFC (Bertinetto et al., 2016c), showing significant improvement
in accuracy and robustness with fewer iterations for initialization.
To obtain a robust initial target model, the meta-training utilizes the
larger-scale video detection dataset (Russakovsky et al., 2015) for pre-
Fig. 19. The example of deep reinforcement learning for tracking (Ren et al., 2018).
training, which is different from MDNet and CREST. In MLT (Choi
The actor network takes the concatenation (1024-dim) of the feature of a candidate
box 𝑓 and the current target feature 𝑓 ∗ as input and predicts one of the four et al., 2019), a meta-learner network is proposed to provide the Siamese
actions (Continue, Stop & update, Stop & ignore, Restart), and the prediction network network the target-specific information to adapt to the new appearance
produces the shift vector 𝛿 = {▵𝑥 , ▵𝑦 , ▵𝑤 , ▵ℎ }. Combining the actions, shift 𝛿, and the of the target.
concatenated features as input (1032-dim), the critic network estimate the Q-value and
Huang et al. (2019) and Wang et al. (2020) both employed a meta-
fine-tune the prediction network 𝜙 and actor network 𝜃.
learner (e.g., MAML, Finn et al., 2017) to fast convert the general
object detector into an tracker. The former utilized the anchor-based
detectors (e.g., SSD and Faster R-CNN, Ren et al., 2015), and the latter
and Ramanan, 2017) make decisions on tracking status or estimate the applied the anchor-free detector (e.g., FCOS, Tian et al., 2020) and
target motion in a separate way. The limitation of ADNet (Yun et al., anchor-based detector (e.g., RetinaNet, Lin et al., 2017). During online
2017) and DRL-IS (Ren et al., 2018) is that the tracking speed is still far tracking, the target model is continuously updated with fewer steps.
from real-time due to the requirement of many iterative steps in each We show the training process of Wang et al. (2020) in Fig. 20 for more
frame. Recently, by performing in continuous action space, Chen et al. detail of the meta-learning for tracking.
(2018) proposed an ACT tracker based on the ‘Actor-Critic’ framework
that runs at 30 fps and obtains comparable performance. Compared
5.1.8. Learning weights with deep neural networks
with ADNet, ACT requires only one action generated by ‘Actor’ to locate
To investigate the merits of correlation filters (CFs) in visual track-
the target in every frame.
ing, existing studies learn correlation filters via deep neural networks,
such as CFNet (Valmadre et al., 2017), LSART (Sun et al., 2018b),
5.1.7. Few-shot learning for tracking FlowTr (Zhu et al., 2018b), and RTINet (Yao et al., 2018). CFNet (Val-
The starting point of few-shot learning is to learn a new concept madre et al., 2017) and FlowTr (Zhu et al., 2018b) formulated the
with a few examples (few-shot learning) or a single example (one-shot correlation filters as a layer in neural networks that can be trained by
learning). During online tracking, we only know the bounding box of back-propagation. LSART (Sun et al., 2018b) applied the complemen-
the target in the initial frame and there are few examples available tary kernelized ridge regression (KRR) model and spatially regularized
for training the model. An early work by Bertinetto et al. (2016a) CNN model to improve the capacity of the tracker, which corresponds
constructed the learnet in a one-shot learning manner to directly pre- to the holistic information and local regions, respectively. The weights
dict the parameters of a pupil network. The learning process can be in KRR and CNN models were learned by deep neural networks. To
formulated as the following objective function: exploit the improvement on advanced DCF trackers, Yao et al. (2018)
introduced RTINet that combined the learning of deep representation
1∑ ( (
𝑛
( )) )
min′  𝜙 𝑥𝑖 ; 𝜔 𝑧𝑖 ; 𝑊 ′ , 𝓁𝑖 (17) and correlation filters (i.g., BACF, Galoogahi et al., 2017b). They imple-
𝑊 𝑛 𝑖=1
mented the CFs learning in BACF with an updater network by unrolling
where 𝑊 ′ is the parameters we need to learn given the training set the ADMM algorithm. In addition, DSLT (Lu et al., 2018) learns the
consists of triplets ⟨𝑥𝑖 , 𝑧𝑖 , 𝓁𝑖 ⟩. 𝜔 is the function that maps (𝑧𝑖 ; 𝑊 ′ ) to 𝑊 weights of the regression network with a shrinkage loss directly, which
in standard discriminative learning form, 𝜙 is the predictor function. can be trained with the Adam (Kingma and Ba, 2021) optimizer.

13
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508

5.2.3. Online learning methods


Most Siamese network based trackers exhibit excellent performance
such as the high tracking speed and accurate state estimation due to
their offline training without online updates (e.g., SiamFC, Bertinetto
et al., 2016c) or simple linear template updates (e.g., DaSiamRPN, Zhu
et al., 2018a). These mechanisms, in turn, become the limitations when
dealing with more challenging scenarios. In visual object tracking,
online learning aims to learn the model online to adapt to target varia-
tions. DSiam (Guo et al., 2017) performs online learning of two trans-
formations matrices to capture the target appearance variation infor-
mation and enable background suppression. Moreover, DiMP50 (Bhat
Fig. 20. The example of meta learning for tracking (Wang et al., 2020). In one et al., 2019) predicts the target model online with an optimized gradi-
iteration, the losses on training and test sets are calculated. The parameters 𝜃𝑘 are ent descent algorithm to adapt to the appearance variations. Recently,
updated in every training step as shown in the upper flow, where the meta parameters Meta-learning provides a way to quickly learn a new concept using
𝜃0 are updated with the accumulated meta-gradients of all testing images as shown in
a few examples and training iterations, especially the learning of the
the bottom flow. The meta-learning method in Huang et al. (2019) shares a similar
spirit in the above process. model updater in the tracking area. During the online tracking, Li
et al. (2019c) utilized the meta-learning framework to learn the target
model by a recurrent model updater, in which the long-term target
information is aggregated by a ConvGRU module.
5.2. Different training strategies in deep trackers
DSLT (Lu et al., 2018) incrementally update the regression network
frame-by-frame during tracking. In DCFST (Zheng et al., 2020), the
In deep learning based trackers, there are three different ways ridge regression solver was integrated into the CNNs and a shrinkage
the trackers utilized to build the appearance model according to the loss (Lu et al., 2018) was used for fast training of the target model.
utilization and update of the CNN model. In this section, we discuss During online tracking, the appearance model was updated linearly
three different training strategies for learning the appearance model and the target model was solved by the Gauss–Seidel method. To
including: (1) Using the Pre-trained CNN Model; (2) Offline Learning improve the representation diversity, BranchOut (Han et al., 2017)
Methods; (3) Online Learning Methods. The main difference between utilizes a stochastic ensemble learning strategy to selectively update
the offline learning and online learning is a tracker whether or not to the fully-connected layers.
update its backbone network or target model during tracking. In most
scenarios, the deep trackers with online update strategies also need the 6. Evaluation methodologies
pretraining on external datasets.
6.1. Precision plot

5.2.1. Using the pre-trained CNN model The average center location error (CLE) is the pixel-wise average
A common way in early deep trackers is to directly use the deep fea- Euclidean distance between the center locations of the target and the
tures extracted from a pre-trained CNN model on other tasks (e.g., im- manually labeled ground truth bounding boxes. Given the target’s
age classification). For example, the DeepSTRCF (Li et al., 2018a), ground truth location 𝑠𝑖 and the estimated location 𝑠̂𝑖 , the average
DeepSRDCF (Danelljan et al., 2015a), and CCOT (Danelljan et al., location error can be computed by:
2016) directly use multiple deep features extracted from a pre-trained
1 ∑
𝑁
model like VGG-m network (Chatfield et al., 2014), and combine 𝐶𝐿𝐸 = ‖𝑠 − 𝑠̂𝑖 ‖ (18)
𝑁 𝑖=1 𝑖
them with the hand-crafted features to represent the target, which can
significantly improve the tracking performance. where 𝑁 is the length of a sequence, ‖ ⋅ ‖ is the Euclidean dis-
tance. Similar to the pixel-wise CLE metric, the deviation in Smeulders
et al. (2014) computes the average center location errors between the
5.2.2. Offline learning methods predicted center and the ground truth center by:
To improve the discriminative ability of the tracker, most back- ∑ 𝑖 𝑖
𝑖∈𝑆 𝛿(𝑥𝑇 , 𝑥𝐺 ) 𝑥𝑖 , 𝑥𝑖
bone networks in deep trackers are trained in the offline phase, such 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 =1 − , 𝛿(𝑥𝑖𝑇 , 𝑥𝑖𝐺 ) = ‖ 𝑇 𝐺 ‖ (19)
|𝑆| 𝑠𝑖𝑧𝑒(𝑏𝑡𝐺 )
as SiamFC (Bertinetto et al., 2016c) and MDNet (Nam and Han,
2016). SiamFC (Bertinetto et al., 2016c) trained their network on where 𝑥𝑖𝑇 and 𝑥𝑖𝐺 are the centers of 𝑖th predicted and ground truth
ILSVRC (Russakovsky et al., 2015) dataset. In addition to employing bounding boxes in a set of frames, 𝑏𝑡𝐺 denotes the ground truth bound-
large scale datasets, such as Youtube-BB (Real et al., 2017) and ILSVRC ing box of the target in 𝑡th frame. Function 𝛿(𝑥𝑖𝑇 , 𝑥𝑖𝐺 ) is the Euclidean
in SiamRPN (Li et al., 2018c), the data augmentation techniques distance between two centroids, which is normalized by the size of the
(e.g., the flip, rotation, shift, stretching, blur, and dropout in Bhat et al., ground truth bounding box 𝑏𝑡𝐺 . |𝑆| denotes the number of successfully
2018; the distorting colors by adding image brightness, contrast, satu- tracked bounding boxes. The drawback of the above two metrics is
ration; hue in Yang and Chan, 2017) that expand the training datasets that such estimation may fail to reflect the performance of the tracking
also improve the network’s discriminative ability and prevent it from algorithms (Babenko et al., 2011) correctly. Instead, the precision
plots (Babenko et al., 2011; Wu et al., 2013) show the percentage of
over-fitting. In GOTURN (Held et al., 2016), the cropped image pairs
frames in which the estimated target locations are within the threshold
from still images are added to augment the training set and are useful to
distance of the ground truth, and can be defined as:
enhance the discriminative ability of the network. In DaSiamRPN (Zhu
et al., 2018a), the diverse categories of positive pairs collected from 𝑁𝜏
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = (20)
more complex datasets (i.g., ImageNet Detection, Russakovsky et al., 𝑁𝑓 𝑟𝑎𝑚𝑒𝑠
2015 and COCO Detection, Lin et al., 2014) and semantic negative pairs where 𝑁𝜏 denotes the number of successfully tracked frames that the
extracted from both the same categories and different categories are center location error is within a threshold 𝜏 (e.g., 20 pixels). 𝑁𝑓 𝑟𝑎𝑚𝑒𝑠
utilized to improve the tracker’s discriminative ability. denotes the number of total frames in a sequence.

14
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508

6.2. Success plot 6.5. LSM metric

The precision metric only measures the localization accuracy and The Longest Subsequence Measure (LSM) proposed in TLP (Liu
does not reflect the scale of the target. Furthermore, the success plot et al., 2015) dataset aims to evaluate the long-term tracking per-
represents the percentage of frames for which the overlap between the formance. In detail, LSM computes the percentage of frames of the
estimated bounding box and the ground truth exceeds a threshold. Let longest successfully tracked continuous subsequence among the whole
𝑏𝑡 and 𝑏𝑔 denote the bounding boxes of the tracked target and ground sequence. For a tracking dataset with N sequences in total, the LSM at
truth respectively, we define the intersection-over-union (IoU) between the threshold 𝑥% can be computed as:
𝑏𝑡 and 𝑏𝑔 as:
1 ∑
𝑁
𝑅𝐿𝑆𝑀 (𝑥) = 𝑟 (𝑥) (23)
𝑏𝑡 ∩ 𝑏𝑔 𝑁 𝑖 𝑖
𝐼𝑜𝑈 = (21)
𝑏𝑡 ∪ 𝑏𝑔
where 𝑟𝑖 (𝑥) is the indicator function which got 1 if the ratio of the
where ∩ is the intersection of the two bounding boxes, and ∪ denotes length of the longest successful tracked subsequence no less than 𝑥%
the union of the two bounding boxes. Then the average overlap mea- and 0 otherwise. The parameter 𝑥 varies from 0 to 1. The 𝑅𝐿𝑆𝑀 (𝑥)
sure based on IoU can be defined as the mean of IoU among the whole denotes the ratio of successfully tracked sequences to the whole dataset,
sequence. Therefore, given a threshold 𝐴, the success of a sequence can where a sequence is successfully tracked if 𝑥% frames in it have IoU >
be computed as: 0.5. This metric reflects the ability of a tracker to continuously track
the target in a sequence without failure.
1 ∑
𝑁
𝑁𝐴
𝑅𝐴 = = 1 (𝑠 ) (22)
𝑁𝐺 𝑁 𝑖=1 𝐴 𝑖
6.6. F-score plot
here 𝑠𝑖 is the IoU between tracking result of 𝑖th frame and ground truth
bounding box. 1𝐴 (𝑠𝑖 ) is an indicator function which got 1 if 𝑠𝑖 below a
To take both the precision and recall measures into account, Lukežič
threshold 𝐴 and 0 otherwise.
et al. (2021) proposed F-score defined on precision and recall to evalu-
ate long-term tracking performance. In this metric, precision indicates
6.3. Robustness the accuracy of target absence prediction and recall reflects the target
re-detection capability of the tracker. Given a video sequence and its
ground truth bounding boxes of the target, the F-score can be computed
One pass evaluation (OPE), temporal robustness evaluation (TRE),
as:
and spatial robustness evaluation (SRE) are three evaluation metrics
proposed in Wu et al. (2015). OPE evaluates a tracker throughout a 2𝑃 𝑟(𝜏𝜃 )𝑅𝑒(𝜏𝜃 )
𝐹 (𝜏𝜃 ) = (24)
test sequence given the initialization (i.e., the bounding box of the 𝑃 𝑟(𝜏𝜃 ) + 𝑅𝑒(𝜏𝜃 )
target object) of the first frame, where different initialization states where 𝑃 𝑟(𝜏𝜃 ) and 𝑅𝑒(𝜏𝜃 ) are the precision and recall at the classification
would affect the tracking performance. Besides, TRE and SRE were threshold 𝜏𝜃 , respectively, each of which is defined as:
proposed (Wu et al., 2013) as alternatives to analyze the robustness 1 ∑
of a tracker under different initialization states, including spatial cir- 𝑃 𝑟(𝜏𝜃 ) = 𝛺(𝐴𝑡 (𝜏𝜃 ), 𝐺𝑡 )
𝑁𝑝
𝑡∈{𝑡∶𝐴𝑡 (𝜏𝜃 )≠∅}
cumstances (i.e., starting with different bounding boxes) and temporal (25)
1 ∑
circumstances (i.e., starting at different frames). Furthermore, SRE with 𝑅𝑒(𝜏𝜃 ) = 𝛺(𝐴𝑡 (𝜏𝜃 ), 𝐺𝑡 )
𝑁𝑔
restart (SRER) evaluates whether a tracker is sensitive to the spatial 𝑡∈{𝑡∶𝐺𝑡 ≠∅}
turbulence and OPE with restart (OPER) measures the performance
where 𝐴𝑡 (𝜏𝜃 ) denotes the predicted state at time-step 𝑡 with the 𝜏𝜃 be a
of the tracker with restarting when a tracking failure occurs. In the classification threshold, and 𝜃𝑡 is the predicted classification score. 𝐺𝑡
VOT tracking benchmark, the robustness (R) measures how many times is the corresponding ground truth target state. 𝛺(𝐴𝑡 (𝜏𝜃 ), 𝐺𝑡 ) computes
the tracker loses the target (fails) during tracking, where a failure is the intersection over union between the predicted state 𝐴𝑡 (𝜏𝜃 ) and the
detected when the overlap measure becomes zero. In VOT2015 (Kristan ground truth 𝐺𝑡 . In long-term tracking, the ground truth is empty if the
et al., 2015), both the average overlap and robustness (failures) are target is absent, i.g., 𝐺𝑖 = ∅. Similarly, the predicted is set as empty
considered to obtain the Expected Average Overlap (EAO). if the classification score is below a classification threshold 𝜏𝜃 , i.g.,
𝜃𝑡 < 𝜏𝜃 . 𝑁𝑝 denotes the number of frames where 𝐴𝑡 (𝜏𝜃 ) ≠ ∅. 𝑁𝑔 is the
6.4. Tracking length number of frames with 𝐺𝑡 ≠ ∅.

Tracking length (Čehovin et al., 2016) is another metric that com- 6.7. Tracking speed
putes the number of successfully tracked frames between the initialized
frame and its first failure which can be detected by center or overlap Tracking speed is an important criterion in tracking, especially in
measure given a predefined threshold 𝛾. This criterion may be suffered practical applications, which is expressed as the mean number of frames
from some difficult scenarios, including tracking failure due to sudden that can be processed by a tracker per second. Nevertheless, unlike
target shift, poor initialization due to low resolution, out-of-view, or accuracy and robustness that can be evaluated and compared fairly
heavy occlusion which happened close to the initialization frame. Once across different trackers within the same experiment and dataset, the
a tracking failure is detected, the tracker will stop working, and the rest tracking speed is determined by not only the computation of a tracking
of the sequence will be discarded. So, the failure rate (Čehovin et al., algorithm itself but also some other factors, such as implementation
2016) that measures the mean number of failures per sequence, which platforms (e.g., Windows, Linux, and Mac), programming languages
also was introduced in VOT2016 (Kristan et al., 2016) as a performance (e.g., C, C++, PYTHON, and MATLAB), hardware platforms that are
evaluation metric, can reflect the robustness of the tracker in the entire used to perform the experiments, and the deep learning frameworks
sequence. such as Caffe, Torch, Tensorflow, Julia, MXNet, and MatConvNet.

15
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508

Table 1 continuous attributes are defined in OxUvA (Gavves et al., 2018)


Annotated sequence attributes in the tracking benchmark.
dataset which can be computed directly from bounding box annotation
Attr Description and meta-data. The continuous attributes include size, relative speed,
IV Illumination variation: The illumination in the target region changes scale change, object absence, distractors, and length. Furthermore,
significantly.
we can convert these continuous attributes to binary attributes via
SV Scale variation: The ratio of the bounding boxes of the first frame
and the current frame is out of range [0.5, 2].
thresholding. For a tracker, overall performance usually reflects the
OCC Occlusion: The target is partially or fully occluded. general capability for tracking. However, the performance on different
DEF Deformation: Non-rigid object deformation. attributes can reflect the strength and weakness of each tracker. Among
MB Motion blur: The target is blurred due to the motion of target or these tracking datasets, each of the large-scale datasets such as GOT-
camera.
10k, LaSOT, and TrackingNet contains more than one thousand video
FM Fast motion: The motion of the target is larger than the specific
number of pixels. sequences in broad and diverse contexts, providing the researchers with
IPR In-plane rotation: The target rotates in the image plane. new conditions to train deep trackers. In addition to the summary of
OPR Out-of-plane rotation: The target rotates out of the image plane. tracking datasets, detailed descriptions of attributes are presented in
OV Out-of-view: The target is partially or fully leaves the camera field
Table 1. For brevity, we merge the attributes that come from different
of view.
BC Background clutters: The background near the target has similar
datasets but with the same meaning. For example, SOB and CON both
appearance as the target. denote the similar objects or confusion. ARC and LSV both denote the
LR Low resolution: At least one ground truth bounding box has less aspect ratio changes. The training and validation subsets of GOT-10k
than threshold pixels (i.e.g 400 pixels in OTB-100 or 1000 pixels in also provide the annotation of visible ratios besides the bounding boxes
LaSOT).
of the target objects.
SHC Shadow change: The shadow on the target changes over time.
FL Flash: The flash on the target changes significantly.
DL Dim light: The light around the target is dim. 8. Experimental results and analysis
POC Partial occlusion: The target is partially occluded.
FOC Full occlusion: The target is fully occluded.
In this section, detailed experiments are performed to evaluate state-
VC Viewpoint change: Viewpoint affects target appearance significantly.
CM Camera motion: Abrupt motion of the camera. of-the-art trackers on OTB, VOT, LaSOT, GOT-10k, and TrackingNet
SOB/CON Similar objects or confusion: There are objects of similar shape or datasets. Then, we do a comparative analysis of the selected trackers.
same type near the target. In addition to the quantitative results, we also provide the qualitative
FBC Fast background change: The background changes quickly. results on the five datasets. We selected the trackers to be evaluated
ROT Rotation: The target rotates in the image.
CS Camera shake: The movement of the camera results in the target
in these benchmarks from different categories such as sparse learning,
lacks clarity. different kind of DCF trackers, different kinds of deep trackers, and
ARC/LSV Aspect ratio changes: The fraction of bounding box aspect ratio is so on. Due to reason of limited room for us to clearly illustrated the
outside the range [0.5, 2]. tracking results on OTB and LaSOT datasets, there are 44 and 36
MC Motion change: The absolute difference between ground truth
trackers shown in two benchmarks, respectively. we reproduce the
center positions in consecutive frames is large.
S/LO Small/large objects: Targets with very small or very large success and precision plots with the raw tracking data provided by
resolutions compared to the average size of a target in a sequence. the original paper or with the released source code that contains the
LI Light: Uneven and unstable light. configuration for reproducing the tracking results. For VOT datasets,
SC Surface cover: The surface cover of the target changes drastically the trackers are evaluated by the official benchmark tools. As shown
over time.
SP Specularity: Specular highlights occlude the real target and bring
in Table 4, we list 38 popular trackers for comparison. There are 35
abrupt changes. trackers shown in Table 5 on the GOT-10k dataset. For the TrackingNet
TR Transparency: Uneven and unstable transparency. dataset, we list 43 trackers for comparison.
MS Motion smoothness: Speed of the target is unstable. Both the GOT-10k and TrackingNet benchmarks provide online
MCO Motion coherence: Parts of target move with different motions.
server for evaluating the tracking performance and the ground truth of
LC Low contrast: The boundaries of the target against the background
are not well defined for distinguishing the target and background. test set is invisible for researchers. We employed some tracking results
ZC Zooming camera: Unstable zooming camera changes scale of the released on the leaderboard of the official website, while some data
whole scene. were token from original paper.
LV Long videos: Length of the sequence between one and two minutes.

uneven: spatial changes, unstable: temporal changes. 8.1. Quantitative evaluation on OTB datasets

OTB-2013 (Wu et al., 2013) is an online object tracking benchmark


7. Tracking datasets which contains 51 annotated targets with ground truth bounding boxes
and 11 different attributes, in which there are two annotated targets in
Many tracking benchmarks have been proposed for measuring the the 𝐽 𝑜𝑔𝑔𝑖𝑛𝑔 sequence. OTB-50 (Wu et al., 2015) is slightly different
different tracking algorithms, including both short-term and long-term with OTB-2013 which consists of 49 sequences with 50 annotated
tracking. In Table 2, we summarize the key statistics among the existing targets, where the 𝑆𝑘𝑎𝑡𝑖𝑛𝑔2 sequence includes two annotated target
tracking datasets such as OTB-100 (Wu et al., 2015), VOT2015 (Kris- objects. Please note that OTB-100 and OTB-2015 (Wu et al., 2015) refer
tan et al., 2015), VOT2016 (Kristan et al., 2016), VOT2018 (Kristan to the same dataset, which is an extension of OTB-50 and contains 98
et al., 2018), TLP (Moudgil and Gandhi, 2017), UAV123 (Mueller fully annotated sequences with 100 target objects.
et al., 2016), ALOV300++ (Smeulders et al., 2014), TColor-128 (Liang We evaluate the trackers by the official toolkit (Wu et al., 0000).
et al., 2015), OxUvA (Gavves et al., 2018), LTB35 (Lukežič et al., The overall performance of trackers is summarized by the success and
2018), GOT-10k (Huang et al., 2018), LaSOT (Fan et al., 2019), precision plots in Fig. 21. The precision plots show the percentage
TrackingNet (Mueller et al., 2018), NfS (Galoogahi et al., 2017a), of frames that the tracking results are within a threshold distance
and NUS-PRO (Li et al., 2016a). These datasets are annotated with from the target and the scores at the threshold of 20 pixels are used
different attributes which we summarized in Table 3. For low resolu- to rank the trackers. The legend of success plots shows the ratios
tion (LR), the lower threshold is 400 pixels for UAV123 (Liu et al., of successful frames when the threshold varies from 0 to 1 and the
2015), TC-128, NfS, and OTB-100 datasets, while is 1000 pixels for area-under-curve (AUC) scores are used to rank the trackers. In most
TrackingNet (Mueller et al., 2018) and LaSOT (Fan et al., 2019). In benchmarks, the average overlap score (AOS) over the sequence is used
addition to binary attributes used in the above datasets, six different to approximate AUC for efficient calculation. Čehovin et al. (2016) and

16
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508

Table 2
Comparison of most popular tracking benchmarks in the literature.
Dataset Sequences Frames Mean frames Classes Resolution Visual attributes Shot/Long-term
OTB-100 (Wu et al., 2015) 100 59k 598 16 – 11 S
vot2015 (Kristan et al., 2015) 60 21455 357 20 – 11 S
vot2016 (Kristan et al., 2016) 60 21455 357 20 – 5 S
vot2018 (Kristan et al., 2018) 60 21356 356 24 – 5 S
tlp (Moudgil and Gandhi, 2017) 50 676k 13k 17 1280 × 720 6 L
uav123 (Mueller et al., 2016) 123 113k 915 9 – 12 S
alov300++ (Smeulders et al., 2014) 315 8936 483 – – 14 S
tcolor-128 (Liang et al., 2015) 129 55k 431 27 – 11 S
oxuva (Gavves et al., 2018) 366 1.55m 4.2k 22 – 6 L
ltb35 (Lukežič et al., 2018) 35 146k 4k 19 1280 × 720 ∼ 290 × 217 10 L
got-10k (Huang et al., 2018) 10k 1.5m 149 563 – 6 S
lasot (Fan et al., 2019) 1400 3.52m 2506 70 1280 × 720 14 L
trackingnet (Mueller et al., 2018) 30k 14m 471 27 – 15 S
nfs (Galoogahi et al., 2017a) 100 380k 3.8k 17 – 9 L
nUS-PRO (Li et al., 2016a) 365 109k 370 8 1280 × 720 12 S

Fig. 21. The overall performance of precision and success plots on the OTB-2013 (Wu et al., 2013), OTB-50 (Wu et al., 2015), and OTB-100 (Wu et al., 2015) datasets using
one-pass evaluation (OPE).

the supplementary material of Wu et al. (2015) proved that AOS equals However, the DCF trackers can still achieve comparable precision in
to AUC with enough uniformly sampled thresholds. situations like occlusion, out-of-view, and motion blur as shown in
As shown in Fig. 21, the deep trackers and the DCF trackers with Fig. 23.
deep features achieve high performance in terms of accuracy and Although VITAL (Song et al., 2018) achieves high precision of
robustness. The Siamese tracker STMTrack (Fu et al., 2021) shows con- 91.7% on OTB-100, ECO (Danelljan et al., 2017a) achieves much
siderable superiority over others in terms of robustness with the highest better robustness. In Fig. 22, ECO (Danelljan et al., 2017a) exhibits
AUC score of 71.9%. The ECO (Danelljan et al., 2017a) tracker, which
its robustness in most of the challenging scenarios, whereas it is more
employs the feature maps extracted from the VGG-m network (Chatfield
sensitive to the low resolution than SA-SIAM (He et al., 2018) and
et al., 2014), outperforms the ECO-HC (Danelljan et al., 2017a) that
VITAL (Song et al., 2018). Because the high-level deep features are
uses hand-crafted features (e.g., HOG and Color Names) by a large
robust to the appearance variety of objects, SiamR-CNN (Voigtlaender
margin. The adaptive spatial regularized DCF tracker ASRCF (Dai et al.,
et al., 2020) achieves the best robustness.
2019) still outperforms the top Siamese trackers and achieves the
highest precision of 92.2%. Both ECO (Danelljan et al., 2017a) and The overall performance on the OTB-100 dataset also indicates that
CCOT (Danelljan et al., 2016) can achieve competitive precision. We the dataset has become highly saturated over recent years. Both the
attribute this to more accurate and stable filters that are learned in AUC and precision have relatively small gap between the top trackers
the DCF tracker than those in the deep Siamese tracker. In Fig. 22, such as STMTrack (Fu et al., 2021), SiamGAT (Guo et al., 2021),
the robustness of state-of-the-art deep Siamese trackers (e.g., SiamBAN SiamR-CNN (Voigtlaender et al., 2020), and TrDiMP (Danelljan et al.,
and SiamR-CNN) surpasses others in a number of challenging scenarios. 2020).

17
F. Chen, X. Wang, Y. Zhao et al.
Table 3
Attributes of the benchmarks.
Dataset IV SV OCC DEF MB FM IPR OPR OV BC LR FOC VC CM POC ARC/LSV SOB/CON FBC ROT CS FL DL SHC MC S/LO SC SP TR MS MCO LC ZC LV LI
OTB-100 (Wu et al., 2015) ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
VOT2015 (Kristan et al., 2015) ✓ ✓ ✓ ✓ ✓
VOT2016 (Kristan et al., 2016) ✓ ✓ ✓ ✓ ✓
VOT2018 (Kristan et al., 2018) ✓ ✓ ✓ ✓ ✓
TLP (Moudgil and Gandhi, 2017) ✓ ✓ ✓ ✓ ✓ ✓ ✓
18

UAV123 (Mueller et al., 2016) ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓


ALOV300++ (Smeulders et al., 2014) ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
TColor-128 (Liang et al., 2015) ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
LTB35 (Lukežič et al., 2018) ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
GOT-10k (Huang et al., 2018) ✓ ✓ ✓ ✓ ✓ ✓
LaSOT (Fan et al., 2019) ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
TrackingNet (Mueller et al., 2018) ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
NfS (Galoogahi et al., 2017a) ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
NUS-PRO (Li et al., 2016a) ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓

Computer Vision and Image Understanding 222 (2022) 103508


F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508

Fig. 22. The success plots of each attribute on OTB-100 (Wu et al., 2015), including background clutter, deformation, illumination variation, in-plane rotation, low resolution, out
of view, motion blur, occlusion, out-of-plane rotation, scale variation, and fast motion.

8.2. Quantitative evaluation on VOT datasets more sequences with non-rigid object deformation. In Table 4, the deep
trackers like SiamAttn (Yu et al., 2020) and OceanOn (Zhang et al.,
In this section, we evaluate state-of-the-art trackers on three VOT 2020) achieve high performance compared to DCF trackers. The high
datasets by vot-toolkit (Kristan et al., 2017). The results of three metrics performance benefits from deeper backbone networks and discrimi-
that including Expected Average Overlap (EAO), Accuracy Value (Av),
native detection module, for example, the multi-level RPN modules
and Robustness Value (Rv) are shown in Table 4. Each of the VOT
in SiamAttn (Yu et al., 2020), the bounding box adaptive network
datasets contains 60 sequences. VOT2015 and VOT2016 contain the
in SiamBAN (Chen et al., 2020), and the segmentation prediction
same sequences, where the main difference is that the ground truth
bounding boxes in VOT2016 are more accurate than those in VOT2015. network in D3S (Lukežič et al., 2020). Ocean significantly outperforms
In VOT2018, the least challenging sequences from VOT2016 are re- SiamRPN++ with a large gain of 5.2% in EAO due to the object-
placed by new sequences. EAO is introduced in VOT2015 to unify aware features that combined with the regular-region features from
the accuracy and robustness of the trackers. The VOT datasets contain convolution improve the reliability of the classification network.

19
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508

Fig. 23. The precision plots of each attribute on OTB-100 (Wu et al., 2015), including background clutter, deformation, illumination variation, in-plane rotation, low resolution,
out of view, motion blur, occlusion, out-of-plane rotation, scale variation, and fast motion.

The conventional trackers remain a large gap to recently deep track- other trackers by a large margin. Among DCF trackers, DRT (Sun et al.,
ers, especially on videos with severe appearance deformation. Among 2018a), which combines multiple deep features (i.g., conv1 from VGG-
the DCF trackers, DeepSTRCF (Li et al., 2018a), CSR-DCF (Lukežič m, conv4-3 from VGG-16, Chatfield et al., 2014) and hand-crafted
et al., 2017), and DeepSRDCF (Danelljan et al., 2015a) mainly fo- features (i.g., HOG and Color Names) to represent target object and
cus on resolving the boundary effects with different regularization introduces the base filter and reliability learning mechanism, achieves
methods. DeepSTRCF (Li et al., 2018a) surpasses its counterparts Deep- the best EAO and robustness scores on VOT2018. The robustness scores
SRDCF (Danelljan et al., 2015a) and CSR-DCF (Lukežič et al., 2017)
illustrate the effectiveness of the basis filter introduced in DRT (Sun
by 3.2% and 3.1% in terms of accuracy on VOT2018. In terms of
et al., 2018a) in learning the discriminative and reliable representation
robustness, DeepSTRCF reduces the failure rate by 14.1% and 49.2%
of the target.
compared to CSR-DCF (Lukežič et al., 2017) and DeepSRDCF (Danelljan
et al., 2015a) on VOT2018, respectively. It illustrates that both the However, the high dimensional deep features used in ECO (Danell-
spatial and temporal regularizations are beneficial in improving the jan et al., 2017a), CFCF (Gundogdu and Alatan, 2018), and DRT (Sun
accuracy and robustness of trackers. et al., 2018a) led to inferior tracking speed than SiamRPN (Li et al.,
Among deep trackers, the accuracy of SiamRPN++ (Li et al., 2019b) 2018c) and DaSiamRPN (Zhu et al., 2018a), as we reported in Table 7.
outperforms SiamRPN (Li et al., 2018c) and DaSiamRPN (Zhu et al., The diverse categories of positive and semantic negative pairs in DaSi-
2018a) by 11% and 3% on VOT2018, respectively, and outperforms amRPN (Zhu et al., 2018a) enhance the tracker’s discriminative ability

20
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508

Table 4
Performance comparison among the state-of-the-art trackers on VOT 2015, VOT 2016, and VOT 2018. The results are presented in terms of EAO, Av and Rv.
The top three results are marked in red, blue, and green fonts, respectively.
VOT2015 VOT2016 VOT2018
Trackers Av Rv EAO Av Rv EAO Av Rv EAO
SiamAttn (Yu et al., 2020) – – – 0.680 0.140 0.537 0.630 0.160 0.470
STMTrack (Fu et al., 2021) – – – 0.629 0.149 0.468 0.590 0.159 0.447
SiamRN (Cheng et al., 2021) – – – – – – 0.595 0.131 0.470
TrDiMP (Wang et al., 2021b) 0.666 0.121 0.490 0.617 0.131 0.480 0.600 0.141 0.462
OceanOn (Zhang et al., 2020) – – – – – – 0.592 0.117 0.489
Ocean (Zhang et al., 2020) – – – – – – 0.598 0.169 0.467
SiamRPN++ (Li et al., 2019b) 0.654 0.201 0.464 0.637 0.178 0.478 0.600 0.234 0.415
SiamR-CNN (Voigtlaender et al., 2020) 0.676 0.201 0.451 0.645 0.173 0.461 0.612 0.220 0.405
SiamBAN (Chen et al., 2020) 0.655 0.136 0.495 0.632 0.150 0.505 0.590 0.178 0.447
PrDiMP50 (Danelljan et al., 2020) 0.680 0.140 0.489 0.652 0.140 0.476 0.618 0.165 0.442
D3S (Lukežič et al., 2020) 0.605 0.150 0.421 0.611 0.131 0.458 0.591 0.159 0.459
SiamFC++ (Xu et al., 2020) – – – – – – 0.587 0.183 0.426
SPM-Tracker (Wang et al., 2019a) – – – 0.62 0.21 0.434 0.58 0.3 0.338
ATOM (Danelljan et al., 2019) 0.641 0.185 0.434 0.617 0.190 0.424 0.590 0.201 0.401
DiMP50 (Bhat et al., 2019) 0.643 0.159 0.452 0.624 0.136 0.479 0.597 0.152 0.44
ASRCF (Dai et al., 2019) 0.581 0.286 0.318 0.568 0.187 0.390 0.492 0.234 0.328
DeepSTRCF (Li et al., 2018a) 0.591 0.295 0.321 0.569 0.248 0.341 0.523 0.215 0.345
SA-SIAM (He et al., 2018) 0.594 0.342 0.313 0.543 0.337 0.291 0.5 0.459 0.236
DRT (Sun et al., 2018a) 0.556 0.169 0.389 0.557 0.173 0.390 0.519 0.201 0.356
CSR-DCF (Lukežič et al., 2017) 0.563 0.262 0.320 0.524 0.239 0.338 0.491 0.356 0.256
CFCF (Gundogdu and Alatan, 2018) 0.578 0.239 0.336 0.560 0.169 0.384 0.511 0.286 0.283
CCOT (Danelljan et al., 2016) 0.544 0.243 0.303 0.541 0.239 0.331 0.494 0.318 0.267
SiamRPN (Li et al., 2018c) 0.604 0.262 0.358 0.578 0.314 0.340 0.490 0.464 0.244
DaSiamRPN (Zhu et al., 2018a) 0.639 0.183 0.446 0.609 0.225 0.401 0.570 0.337 0.326
ECO (Danelljan et al., 2017a) 0.570 0.310 0.314 0.555 0.201 0.374 0.484 0.276 0.281
ECO_HC (Danelljan et al., 2017a) 0.563 0.361 0.280 0.542 0.304 0.322 0.494 0.435 0.238
DSST (Danelljan et al., 2017b) 0.549 0.763 0.172 0.535 0.707 0.181 0.395 1.452 0.079
LSART (Sun et al., 2018b) 0.580 0.196 0.371 0.495 0.215 0.323 0.495 0.276 0.323
MEEM (Zhang et al., 2014c) 0.503 0.501 0.221 0.490 0.515 0.194 0.463 0.534 0.193
Staple (Bertinetto et al., 2016b) 0.580 0.375 0.300 0.547 0.379 0.295 0.530 0.688 0.169
SRDCF (Danelljan et al., 2015b) 0.564 0.332 0.288 0.536 0.421 0.247 0.490 0.974 0.119
DeepSRDCF (Danelljan et al., 2015a) 0.568 0.281 0.318 0.507 0.326 0.276 0.492 0.707 0.154
SiamFC (Bertinetto et al., 2016c) 0.530 0.880 0.290 0.532 0.461 0.88 0.503 0.585 0.188
SAMF (Li and Zhu, 2014) 0.532 0.585 0.202 0.507 0.590 0.186 0.484 1.302 0.093
KCF (Henriques et al., 2015) 0.486 0.670 0.167 0.491 0.571 0.192 0.447 0.773 0.135
DSiam (Guo et al., 2017) 0.541 0.280 – – – – 0.512 0.646 0.196
IVT (Ross et al., 2008) 0.444 1.152 0.122 0.420 1.114 0.115 0.400 1.638 0.075
MIL (Babenko et al., 2011) 0.423 0.735 0.171 0.408 0.726 0.115 0.165 1.011 0.118

70 object categories and each category consists of 20 sequences of


targets. In addition, LaSOT categorized the sequences according to
14 attributes, including aspect ratio change (ARC), background clut-
ter (BC), camera motion (CM), deformation (DEF), fast motion (FM),
full occlusion (FOC), illumination variation (IV), low resolution (LR),
motion blur (MB), out-of-view (OV), partial occlusion (POC), rota-
Fig. 24. Illustrations of the discriminative ability from DaSiamRPN (Zhu et al., 2018a), tion (ROT), scale variation (SV), and viewpoint change (VC). In this
in which the target and distractors are denoted by the red and blue bounding boxes, dataset, they also provide natural language specifications for each
respectively. We can see that the influence of distractor can be effectively suppressed
sequence. To report the quantitative results on LaSOT, we select the
by DaSiamRPN tracker in subfigure (b) during tracking. (For interpretation of the
references to color in this figure legend, the reader is referred to the web version representative state-of-the-arts trackers, including STMTrack (Fu et al.,
of this article.) 2021), SiamRN (Cheng et al., 2021), TrDiMP (Wang et al., 2021b),
OceanOn (Zhang et al., 2020), Ocean (Zhang et al., 2020), SPM-
Tracker (Wang et al., 2019a), PrDiMP50 (Danelljan et al., 2020),
during the offline training phase, which is demonstrated in Fig. 24 (Zhu DiMP50 (Bhat et al., 2019), SiamBAN (Chen et al., 2020), D3S (Lukežič
et al., 2018a). et al., 2020), SiamRPN++ (Li et al., 2019b), MDNet (Nam and Han,
2016), VITAL (Song et al., 2018), SiamFC (Bertinetto et al., 2016c),
DSiam (Guo et al., 2017), ECO (Danelljan et al., 2017a), STRCF (Li
8.3. Quantitative evaluation on LaSOT dataset
et al., 2018a), SINT (Tao et al., 2016), ASRCF (Dai et al., 2019),
ECO_HC (Danelljan et al., 2017a), CFNet (Valmadre et al., 2017),
In this section, detailed results on the LaSOT (Fan et al., 2019) PTAV (Fan and Ling, 2017a), Staple (Bertinetto et al., 2016b), BACF
test set are provided and evaluated by the official toolkit (Fan et al., (Galoogahi et al., 2017b), TRACA (Choi et al., 2018), CSR-DCF (Lukežič
0000). LaSOT dataset contains 1400 sequences and the test set has 280 et al., 2017), SRDCF (Danelljan et al., 2015b), SAMF (Li and Zhu,
sequences with 3.52 million frames. This benchmark classifies existing 2014), fDSST (Danelljan et al., 2017b), MIL (Babenko et al., 2011), and
tracking datasets into short-term tracking and long-term tracking by IVT (Ross et al., 2008).
the average sequence length. In comparison with most short-term Fig. 25 shows the overall precision and success plots overall 280
tracking benchmarks where their average sequence length is less than test videos. Figs. 26 and 27 provide the detailed plots with different
600 frames, the LaSOT dataset focus on long-term tracking in which attributes. We also provide informative quantitative tracking examples
the shortest sequence contains at least 1000 frames and the aver- of several representative trackers in Fig. 28. Different from OTB and
age sequence length is 2512 frames. The whole dataset consists of VOT datasets, the LaSOT dataset contains more challenging sequences

21
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508

to obtain response map. STMTrack (Fu et al., 2021), TrSiam (Wang


et al., 2021b) and TrDiMP (Wang et al., 2021b) take the advantage of
multiple history templates features to improve the tracking success. For
example, TrDiMP (Wang et al., 2021b) outperforms the baseline tracker
DiMP50 (Bhat et al., 2019) and improves the AUC score by 7.7%.
Similarly, STMTrack (Fu et al., 2021) concatenates multiple history
template features to form the space–time cues. In addition to investigate
temporal information along consecutive frames, learning the target
model online also benefit the tracking accuracy such as Ocean (Zhang
et al., 2020) and its online version OceanOn (Zhang et al., 2020).
However, the overall performances on LaSOT are severely degraded
due to a large number of non-rigid target objects and more challenging
scenarios, compared with OTB and VOT datasets. For example, most
of the deep trackers have lost the target in the 𝑔𝑜𝑙𝑑𝑓 𝑖𝑠ℎ-3 sequence in
Fig. 28 due to heavy occlusion, distractors, and deformation. This large
dataset poses another challenge and implies a large room to improve
in visual tracking.

8.4. Quantitative evaluation on GOT-10k dataset

GOT-10k (Huang et al., 2018) is a large-scale tracking dataset with


high-diversity object classes. Compared with LaSOT (Fan et al., 2019),
TrackingNet (Mueller et al., 2018) that have 70 and 21 object classes
respectively, GOT-10k contains 563 object classes in total. This dataset
contains 10k sequences for training and 180 for testing. In addition
to the object classes (e.g., 𝑝𝑒𝑟𝑠𝑜𝑛, 𝑏𝑖𝑟𝑑, 𝑏𝑒𝑎𝑟 and 𝑐𝑎𝑟) labeled in OTB-
100, LaSOT, VOT, and TrackingNet datasets, the videos in GOT-10k
are also labeled with motion classes (e.g., 𝑑𝑖𝑣𝑖𝑛𝑔, 𝑠𝑘𝑎𝑡𝑖𝑛𝑔, 𝑝𝑙𝑎𝑦𝑖𝑛𝑔 and
𝑗𝑢𝑚𝑝𝑖𝑛𝑔). The object classes in training and test sets are non-overlapped
with the 𝑝𝑒𝑟𝑠𝑜𝑛 class as an exception. Meanwhile, the motion classes of
persons between training and test sets are non-overlapped. The Average
Overlap (AO) metric computes the average of overlaps between the
ground truth and predicted bounding boxes of all frames in the test set.
The Success Rate (SR) denotes the percentage of successfully tracked
frames over a sequence where the overlaps are larger than a certain
threshold (e.g., 0.5 or 0.7). Please note that the SR metric is also used
in the OTB-100 (Wu et al., 2015) dataset. Fig. 29 shows the success
plots for state-of-the-art tracking methods in which the trackers are
ranked by average overlap (AO) scores. We also report the AO and
SR at overlap thresholds 0.5 and 0.75 in Table 5. As shown in Ta-
ble 5, top trackers are based on deep learning, which achieve superior
performance compared with others. In addition, TransT (Chen et al.,
Fig. 25. The overall performance of precision and success plots on the LaSOT datset 2021), TrDiMP (Wang et al., 2021b), and TrSiam (Wang et al., 2021b)
using one-pass evaluation (OPE).
share a similar spirit with the recent Transformer approach in visual
tasks (Parmar et al., 2018; Carion et al., 2020). More specifically, the
average overlap score of top-performing TransT outperforms the second
and all sequences are long-term. As shown in Fig. 25, the latest deep tracker by 5.2%. For the success rate at the threshold of 0.75, the
trackers outperform the conventional trackers by a large margin, espe- top-performing TransT outperforms the second tracker by 9.9%. This
cially the SiamR-CNN (Voigtlaender et al., 2020) and TrDiMP (Wang verifies that the transformer module can further improve the capacity
et al., 2021b) perform well against other trackers in terms of precision of deep trackers.
and success. SiamR-CNN tracks both the target object and potential The early deep trackers including GOTURN (Held et al., 2016),
similar-looking distractors by dynamically updating a set of tracklets, MDNet (Nam and Han, 2016), and SimFCv2 (Valmadre et al., 2017)
which are the short sequences of detections that belong to the same ob- were retrained and tested on GOT-10k. MDNet fixed the convolutional
ject. Meanwhile, SiamR-CNN performs global searching and improves layers while updating the fully-connected layers during tracking. In
the possibility of re-detection for a target after the disappearance. addition, GOTURN does not fine-tune the convolutional layers during
Compared to DiMP50, PrDiMP50 explicitly models the label uncer- the pretraining process. GOTURN generalizes better than MDNet on
tainty within a probabilistic regression framework. The training process three metrics. We attribute one reason for this improvement to the
integrates the information of the uncertainty 𝑝(𝑦|𝑦𝑖 ) in the training bounding box regressor GOTURN directly trained offline, while MDNet
sample (𝑥𝑖 , 𝑦𝑖 ), where 𝑥𝑖 is the input frame and 𝑦𝑖 is the target position. only initialized its bounding box regressor as in Girshick et al. (2014)
By predicting the conditional probability density of the target state in the first image which struggled to deal with diverse objects in the
for a test image, this approach improves the accuracy of the target GOT-10k dataset.
localization and bounding box regression as the precision and success Both SiamFC (Bertinetto et al., 2016c) and SiamRPN (Li et al.,
in Fig. 25. 2018c) employ AlexNet (Krizhevsky et al., 2012) as the backbone
Trackers like SiamFC (Bertinetto et al., 2016c), SiamRPN++ (Li network except that the latter removes the filter groups in conv2,
et al., 2019b), SiamBAN (Chen et al., 2020), SiamCAR (Guo et al., conv4, and conv5. The limitation in SiamFC (Bertinetto et al., 2016c)
2020), and Ocean (Zhang et al., 2020) apply cross correlation operation is that it searches the target in the response map produced from a

22
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508

Fig. 26. The success plots on LaSOT (Fan et al., 2019) dataset for each attribute, including aspect ratio change, background clutter, camera motion, deformation, fast motion, full
occlusion, illumination variation, low resolution, motion blur, out-of-view, partial occlusion, rotation, scale variation, and viewpoint change.

cross-correlation layer and estimates the scale in and discrete scale SiamFC++-GoogLeNet (Xu et al., 2020) employ the anchor-free based
space. However, SiamRPN (Li et al., 2018c) uses a Region Proposal bounding box regressor to adapt to scale and aspect ratio changes. In
Network (RPN) which combines a classification branch for target loca- addition, Ocean (Zhang et al., 2020) proved that the features trans-
tion and a bounding box regression branch for refining the proposals. formed from object region can contribute to the discriminative ability
These two separate branches lead to a significant improvement over of the classification branch.
SiamFCv2 (Valmadre et al., 2017), with a gain of 8.9% in the AO Inspired by advanced object detectors, centerness (Xu et al., 2020)
metric. SiamBAN (Chen et al., 2020), Ocean (Zhang et al., 2020), and and IoU (Jiang et al., 2018) are two ways to evaluate the quality of

23
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508

Fig. 27. The precision plots on LaSOT (Fan et al., 2019) dataset for each attribute, including aspect ratio change, background clutter, camera motion, deformation, fast motion,
full occlusion, illumination variation, low resolution, motion blur, out-of-view, partial occlusion, rotation, scale variation, and viewpoint change.

the proposals and can be combined with the classification branch for However, SiamGAT used the target-aware template and they compute
selecting the best proposal. The centerness feature map can be used the correlation scores between the target template and search region
to reweight the quality of the proposals and be combined with the within a complete bipartite graph, improving the accuracy of target
classification branch in SiamRPN (Li et al., 2018c). SiamGAT (Guo localization when the shape and pose of the target are changed dras-
et al., 2021) and SiamCAR (Guo et al., 2020) apply the identical head tically. Specifically, they project the bounding box onto the template
networks that consist of a classification branch and a bounding box feature map to select a region of interest that is more accurate than the
regression branch except for the centerness branch used in SiamCAR. fixed region in SiamRPN.

24
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508

Fig. 28. Qualitative results of the challenging sequences in LaSOT. From top to bottom, the sequences are 𝑎𝑖𝑟𝑝𝑙𝑎𝑛𝑒-15, 𝑏𝑒𝑎𝑟-4, 𝑏𝑖𝑐𝑦𝑐𝑙𝑒-18, 𝑏𝑢𝑠-2, 𝑔𝑎𝑚𝑒𝑡𝑎𝑟𝑔𝑒𝑡-2, 𝑔𝑖𝑟𝑎𝑓 𝑓 𝑒-2, 𝑔𝑜𝑙𝑑𝑓 𝑖𝑠ℎ-3,
𝑚𝑜𝑡𝑜𝑟𝑐𝑦𝑐𝑙𝑒-1, 𝑘𝑖𝑡𝑒-6, and 𝑠𝑢𝑟𝑓 𝑏𝑜𝑎𝑟𝑑-5.

8.5. Quantitative evaluation on TrackingNet dataset boxes of TrackingNet capture the target region as much as possible.
We report the detailed comparison results of state-of-the-art trackers on
TrackingNet (Mueller et al., 2018) contains more than 30k videos TrackingNet in Table 6. As shown in Table 6, the TransT (Chen et al.,
with more than 14 million annotated frames in total. This dataset has 2021) tracker yields the highest success score of 82.06%, precision
been split into a training set with 30,132 training videos from Youtube- score of 80.64%, and normalized precision score of 87.09%. Compared
BoundingBoxes (YT-BB) (Real et al., 2017) and a test set with 511 with SiamR-CNN (Voigtlaender et al., 2020) that uses a re-detection
testing videos from YouTube with a Creative Commons license (YT- scheme, TransT improves the performance by 0.86% in success, 0.64%
CC). The test set was selected with a similar distribution to the training in precision, and 1.69% in normalized precision, respectively. The
set. The whole dataset contains 14,431,266 frames. The annotations recent deep trackers illustrate their strong generalization ability to
for the test set are not visible to researchers except for the initial target variations than the previous trackers. Note that SiamRPN++ (Li
frame of each sequence and all the trackers are evaluated through an et al., 2019b) outperforms D3S (Lukežič et al., 2020) by 0.5%. This
online server in terms of success, precision, and normalized precision. improvement of SiamRPN++ is partially due to its pretraining on the
The success and precision are similar to the OTB-100 dataset. The training set of the TrackingNet. Benefiting from the online target model
normalized precision is obtained by normalizing the original precision adaption with the steepest descent (Wright et al., 1999) method and
over the size of the ground truth bounding box. In addition, more the discriminative learning loss, DiMP50 obtains a better AUC score of
than 30% sequences in OTB-100 (Wu et al., 2015) are annotated with 74.0% compared to SiamRPN++. However, SiamRPN++ obtains a gain
bounding boxes that have a constant aspect ratio, and such annotations of 1% in AUC with multiple features of a deeper backbone network (i.e,
in fact could not reflect the exact target region. While the ground truth ResNet50).

25
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508

Table 6
Comparison results on TrackingNet dataset with performance measures of success,
precision, and normalized precision (NormPrec). The top three results are marked in
red, blue, and green fonts, respectively. SiamFC𝑜𝑟𝑖 and SiamFC𝑓 𝑡 denote the original
model and the model fine-tuned on TrackingNet, respectively. LightTrack-LargeA and
LightTrack-LargeB are two searched trackers under different resource constraints. Please
refer to Yan et al. (2021) for the details.
Trackers Success ↑ Precision ↑ NormPre ↑
TransT (Chen et al., 2021) 82.1 80.7 87.1
SiamR-CNN (Voigtlaender et al., 2020) 81.2 80.0 85.4
STMTrack (Fu et al., 2021) 80.3 76.7 85.1
TrDiMP (Wang et al., 2021b) 78.4 73.1 83.3
TrSiam (Wang et al., 2021b) 78.1 72.7 82.9
SiamGAT (Guo et al., 2021) 75.3 69.7 80.7
LightTrack-LargeA (Yan et al., 2021) 73.6 70.0 78.8
LightTrack-LargeB (Yan et al., 2021) 73.3 70.8 78.9
SiamCAR (Guo et al., 2020) 74.0 68.4 80.4
FCOS-MAML (Wang et al., 2020) 67.7 62.5 75.2
Retain-MAML (Wang et al., 2020) 75.7 – 82.2
CGACD (Du et al., 2020) 71.1 69.3 80.0
SiamAttn (Yu et al., 2020) 75.2 – 81.7
D3S (Lukežič et al., 2020) 72.8 66.4 76.8
KYS (Bhat et al., 2020) 74.0 68.8 80.0
Fig. 29. The overall performance of success rate plots on the GOT-10k dataset using PrDiMP50 (Danelljan et al., 2020) 75.8 70.4 81.6
one-pass evaluation (OPE). PrDiMP18 (Danelljan et al., 2020) 75.0 69.1 80.3
SiamFC++-GooLeNet (Xu et al., 2020) 75.4 70.5 80.0
Table 5 SiamFC++-AlexNet (Xu et al., 2020) 71.2 75.8 64.6
Comparison results on GOT-10k dataset with performance measures of average overlap SiamBAN (Chen et al., 2020) 72.3 68.5 79.4
(AO) and success rate (SR). The SR0.5 denotes the success rates where the overlaps DCFST-50 (Zheng et al., 2020) 75.2 70.0 80.9
exceed 50%, and SR0.75 measures the success rates over the frames where the overlaps DiMP50 (Bhat et al., 2019) 74.0 68.7 80.1
exceed 0.75%. The top three results are marked in red, blue, and green fonts, DiMP18 (Bhat et al., 2019) 72.3 66.6 78.5
respectively. SiamRPN++ (Li et al., 2019b) 73.3 69.4 80.0
SPM-Tracker (Wang et al., 2019a) 71.2 66.1 77.8
Trackers AO ↑ SR0.5 ↑ SR0.75 ↑
ATOM (Danelljan et al., 2019) 70.3 64.8 77.1
TransT (Chen et al., 2021) 0.723 0.824 0.682 C-RPN (Fan and Ling, 2019) 66.9 61.9 74.6
TrDiMP (Wang et al., 2021b) 0.671 0.777 0.583 DaSiamRPN (Zhu et al., 2018a) 63.8 59.1 73.3
TrSiam (Wang et al., 2021b) 0.660 0.766 0.571 UPDT (Bhat et al., 2018) 61.1 55.7 70.2
SiamR-CNN (Voigtlaender et al., 2020) 0.649 0.728 0.597 MDNet (Nam and Han, 2016) 60.6 56.5 70.5
STMTrack (Fu et al., 2021) 0.642 0.737 0.579 SiamFC𝑜𝑟𝑖 (Bertinetto et al., 2016c) 57.1 53.3 66.3
KYS (Bhat et al., 2020) 0.636 0.751 0.515 SiamFC𝑓 𝑡 (Bertinetto et al., 2016c) 58.1 54.3 67.3
PrDiMP50 (Danelljan et al., 2020) 0.634 0.738 0.543 CFNet (Valmadre et al., 2017) 57.8 53.3 65.4
SiamGAT (Guo et al., 2021) 0.627 0.743 0.488 CSR-DCF (Lukežič et al., 2017) 53.4 48.0 62.2
LightTrack-LargeB (Yan et al., 2021) 0.623 0.726 – Staple (Bertinetto et al., 2016b) 52.8 47.0 60.5
DiMP50 (Bhat et al., 2019) 0.611 0.717 0.492 Staple𝐶𝐴 (Mueller et al., 2017) 52.9 46.8 60.5
OceanOn (Zhang et al., 2020) 0.611 0.721 0.473 ECO (Danelljan et al., 2017a) 55.4 49.2 61.8
SiamFC++-GoogLeNet (Xu et al., 2020) 0.607 0.737 0.464 ECO𝐻𝐶 (Danelljan et al., 2017a) 54.1 60.8 47.6
D3S (Lukežič et al., 2020) 0.597 0.676 0.462 BACF (Galoogahi et al., 2017b) 52.3 46.1 58.0
Ocean (Zhang et al., 2020) 0.592 0.695 0.465 SRDCF (Danelljan et al., 2015b) 52.1 45.5 57.3
SiamCAR (Guo et al., 2020) 0.579 0.677 0.436 KCF (Henriques et al., 2015) 44.7 41.9 54.6
SiamBAN (Chen et al., 2020) 0.549 0.651 0.404 DSST (Danelljan et al., 2014) 46.4 46.0 58.8
ATOM (Danelljan et al., 2019) 0.556 0.634 0.402 Struck (Hare et al., 2016) 45.6 40.2 53.9
SiamRPN++ (Li et al., 2019b) 0.510 0.606 0.316
SiamRPN (Li et al., 2018c) 0.463 0.549 0.253
SiamFCv2 (Valmadre et al., 2017) 0.374 0.404 0.144
SiamFC (Bertinetto et al., 2016c) 0.348 0.353 0.098
GOTURN (Held et al., 2016) 0.347 0.375 0.124 by minimizing the KL-divergence) in Target Center Regression (TCR)
CCOT (Danelljan et al., 2016) 0.325 0.328 0.107 and Bounding Box Regression (BBR) is the most crucial component
ECO (Danelljan et al., 2017a) 0.316 0.309 0.111
MDNet (Nam and Han, 2016) 0.299 0.303 0.099 that influences the capacity of the tracker. The DCF based tracker
CFNet_conv2 (Valmadre et al., 2017) 0.293 0.265 0.087 UPDT (Bhat et al., 2018) improves the baseline ECO by combining
ECOhc (Danelljan et al., 2017a) 0.286 0.276 0.096 output scores of deep (e.g., ResNet50) and shallow (e.g., HOG and
CFNet_conv5 (Valmadre et al., 2017) 0.270 0.225 0.072
CFNet_conv1 (Valmadre et al., 2017) 0.261 0.243 0.084 ColorNames) features with an adaptive fusion strategy. The KYS (Bhat
BACF (Galoogahi et al., 2017b) 0.260 0.262 0.101 et al., 2020) tracker shares the backbone network (i.e., ResNet50) and
DSST (Danelljan et al., 2014) 0.247 0.223 0.081 bounding box regressor (i.e., IoU-Net), we observe that there is no
Staple (Bertinetto et al., 2016b) 0.246 0.239 0.089
SRDCF (Danelljan et al., 2015b) 0.236 0.227 0.094
significant improvement in the test set of TrackingNet when compared
fDSST (Danelljan et al., 2017b) 0.206 0.187 0.075 to the baseline tracker DiMP50. The fine-tuned version of SiamFC𝑓 𝑡
KCF (Henriques et al., 2015) 0.203 0.177 0.065 improves its baseline 𝑆𝑖𝑎𝑚𝐹 𝐶𝑜𝑟𝑖 by 1% in terms of both success and
precision, which illustrates that fine-tuning some deep trackers on the
training set of TrackingNet can improve the generality of the test set.
Moreover, we find that gap between the PrDiMP18 (Danelljan In the above subsections, we have presented the tracking results
et al., 2020) and PrDiMP50 (Danelljan et al., 2020) is relatively small on five kinds of popular datasets. The discriminative deep trackers
(e.g., 0.8% in AUC and 1.3% in Precision) when applied ResNet18 have shown superior performance compared to other DCF or generative
and ResNet50 backbone networks respectively. In addition, PrDiMP18 trackers. Because of the deep backbone network, discriminative classi-
outperforms DiMP18 by 2.7% in Success and 2.5% in Precision re- fiers, object detectors, and large-scale training datasets, we can develop
spectively, demonstrating that the probabilistic regression (trained a more powerful model during offline training and online tracking.

26
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508

Fig. 30. Qualitative examples: tracking results of the challenging sequences in OTB2015 on 8 video sequences, including 𝐵𝑖𝑟𝑑1, 𝐽 𝑜𝑔𝑔𝑖𝑛𝑔-2, 𝐶𝑎𝑟4, 𝑀𝑜𝑡𝑜𝑟𝑅𝑜𝑙𝑙𝑖𝑛𝑔, 𝑆𝑘𝑎𝑡𝑖𝑛𝑔2-1, 𝑆𝑘𝑎𝑡𝑖𝑛𝑔1,
𝑆𝑘𝑖𝑖𝑛𝑔, and 𝑆𝑜𝑐𝑐𝑒𝑟. The ground truth bounding boxes are in white.

9. Discussions 9.1. Occlusion

In this section, we first summarize the state-of-the-art trackers In visual object tracking, developing a robust and accurate tracker
including their tracking methods, backbone networks, representation is still challenging due to a vast number of occlusions caused by distrac-
schemes, training datasets, classification/regression methods, training tors or non-semantic backgrounds. Although the DCF based frameworks
schemes, update schemes (no-update, linear-update, non-linear up- have achieved great success due to their high efficiency of computation,
date), tracking speed, published year, implementation platform especially by combining with the deep learning methods, their capacity
(GPU/CPU), re-detection (yes/no, i.e., Y/N), and report what kinds of to deal with object occlusion still needs to be improved. In early works,
tracking frameworks they utilized, including discriminative correlation sparse representations (Mei et al., 2011; Zhang et al., 2012b; Ji et al.,
filter (DCF), particle filter (PF), deep learning (DL), and tracking- 2012; Zhang et al., 2015; Sui et al., 2015) and part-based appearance
by-detection (T&D), as shown in Table 7. The ‘Representation’ col- models (Liu et al., 2015; Wang et al., 2018a) were proposed to deal
umn summarizes the features that trackers used including HOG, Color with the occlusion problem. The sparse trackers model the appearance
of a target with a linear combination of a template set, which can be
Names, PCA-HOG, CIE LAB color feature, HOI (histogram of local inten-
dynamically updated according to the appearance changes (e.g., partial
sities), BMR (Boolean map representation), LBP (local binary pattern),
or heavy occlusion). However, most sparse trackers are performed
intensity, edge, raw pixels, grayscale image, and different features from
within the particle filter framework, where a sufficient number of
CNN. The ‘Backbone’ column presents the neural networks used for
particles are needed to be sampled to guarantee the good performance
feature representation and target modeling, such as AlexNet, CaffeNet,
of the appearance model. Although we can improve the robustness of
VGG-M, VGG-16, VGG-19, ResNet-18/50, and GoogLeNet. The ‘Train-
a tracker with more particles, the high computation cost will affect the
ing Data’ column summarizes the datasets used for training the deep
tracking speed. Furthermore, the part-based trackers model the target
trackers, including ILSVRC2015 VID, ILSVRC2015 DET, Youtube-BB,
by multiple parts to improve the robustness against partial occlusion.
GOT-10k, LaSOT, COCO, TrackingNet, TC-128, Cityperson, WilderFace,
Recently, CPF (Zhang et al., 2018c) and MCPF (Zhang et al., 2017c)
VOC2012, OTB-100, VOT2013, VOT2014, VOT2015, and ALOV300++.
utilized multiple correlation filters (MCF) to adapt the appearance
The ‘Training Scheme’ column summarizes the objective functions and
model to appearance variations. By applying MCF to the particle sam-
training methods of different trackers for further understanding of
pling process, the number of particles can be decreased significantly,
the tracking algorithms. For example, binary cross-entropy is used for
and each of the particles can be shepherded towards a more accurate
training the classification branch, and IoU loss or 𝐿1 loss are used for target location. However, as shown in Figs. 22 and 23, ECO (Danelljan
training the bounding box regression branch. The ‘Localization Method’ et al., 2017a), CCOT (Danelljan et al., 2016), and CFCF (Gundogdu and
column summarizes how the trackers perform target localization and Alatan, 2018) perform better than MCPF (Zhang et al., 2017c) for all
scale estimation during tracking. The ‘Update’ column summarized how attributes except the low resolution and in-plan rotation at precision
the model is updated during tracking. The strategies for the model score.
update can be classified into three categories: (1) no update: The model A re-detection scheme was utilized in TRACA (Choi et al., 2018)
is not updated during tracking, (2) linear update: The model is linearly to deal with the full occlusion scenario, which is determined by the
updated with the EMA (exponential moving average strategy) such as difference between maximal response values of adjacent frames. More-
MOSSE (Bolme et al., 2010), (3) meta-update: The model is updated over, other similarity comparison algorithms can also detect the full
in the meta-learning framework during tracking, (4) nonlinear-update: occlusion or the loss of the target. A phenomenon in occlusion scenario
The model is updated online as a nonlinear process. is that state-of-the-art deep trackers like SiamBAN (Chen et al., 2020)
We then discuss the ten most challenging scenarios including oc- and SiamRPN++ (Li et al., 2019b) have lost the target due to severe oc-
clusion, illumination, motion blur, deformation, scale variation, out- clusion in the 435th frame on sequence 𝑆𝑘𝑎𝑡𝑖𝑛𝑔2-1 in Fig. 30. While the
of-view, background clutter, fast motion, rotation, and low resolution DCF trackers like ECO (Danelljan et al., 2017a) and CFCF (Gundogdu
for the trackers mentioned in Table 7, according to their tracking and Alatan, 2018) still have tracked the target successfully, which illus-
performance. Finally, we discuss the model update, motion model, trates the weakness of some deep Siamese trackers (e.g., SiamRPN++)
hyper-parameter tuning, and tracking speed. in dealing with occlusion. A potential reason for such failure in these

27
F. Chen, X. Wang, Y. Zhao et al.
Table 7
Summary of state-of-the-art trackers with different attributes. C-Classification, R-Regression, SGD-Stochastic Gradient Descent, ADMM-Alternating Direction Method of Multipliers, BPTT-Backpropagation Through Time, APG-Accelerated
Proximal Gradient, MSE-Mean Square Error, CN-Color Names, BBR-Bounding Box Regression, CG-Conjugate Gradient.
Trackers Frameworks Backbone Representation Training data Localization method Training scheme Update FPS GPU Re- Year
detection
SiamRN (Cheng et al., DL ResNet-50 Block-3,4,5 ILSVRC2015 VID, binary classification, binary cross-entropy no-update – Y N 2021
2021) Youtube-BB, GOT-10k, BBR loss+IoU loss+MSE
LaSOT, ILSVRC2015 DET loss+SGD
SiamGAT (Guo et al., DL GoogLeNet Inceptionv3 ILSVRC2015 VID, binary classification, binary cross-entropy no-update 70 Y N 2021
2021) Youtube-BB, GOT-10k, BBR loss+IoU loss+SGD
COCO, ILSVRC2015 DET
STMTrack (Fu et al., 2021) DL GoogLeNet Inceptionv3 ILSVRC2015 VID, binary classification, focal loss+binary linear-update 37 Y N 2021
ILSVRC2015 DET, GOT-10k, BBR cross-entropy loss+IoU
COCO, LaSOT, TrackingNet loss+SGD
LightTrack (Yan et al., DL supernet – ILSVRC2015 VID, binary classification, binary cross-entropy no-update – Y N 2021
2021) ILSVRC2015 DET, COCO, BBR loss+IoU loss+SGD
GOT-10k, Youtube-BB
TransT (Chen et al., 2021) DL ResNet-50 Block-4 COCO, TrackingNet, LaSOT, binary classification, giou loss+𝐿1 no-update 50 Y N 2021
GOT-10k BBR loss+cross-entropy
loss+AdamW
SiamCAR (Guo et al., DL ResNet-50 Block-3,4,5 COCO, ILSVRC2015 VID, binary classification, binary cross-entropy no-update 52 Y N 2020
2020) ILSVRC2015 DET, BBR loss+IoU
Youtube-BB, GOT-10k, loss+centerness
LaSOT loss+SGD
28

TrDiMP (Wang et al., DL ResNet-50 – COCO, TrackingNet, LaSOT, binary classification, 𝐿2 norm loss+AdamW nonlinear-update 26 Y N 2021
2021b) GOT-10k IoU-prediction
KYS (Bhat et al., 2020) DL ResNet-50 Block-4 TrackingNet, LaSOT, binary classification, 𝐿2 norm nonlinear-update 20 Y N 2020
GOT10k IoU-prediction loss+MSE+binary
cross-entropy+Adam
Ocean (Zhang et al., 2020) DL ResNet-50 Block-4 Youtube-BB, ILSVRC2015 binary classification, IoU loss+binary no-update 58 Y N 2020
VID, ILSVRC2015 DET, BBR cross-entropy+SGD
GOT-10k, COCO
Retina/FCOS-MAML (Wang DL ResNet-18 Block-4 COCO, GOT-10k, binary classification, 𝐿2 norm loss(C)+𝐿1 meta-update 40 y N 2020
et al., 2020) TrackingNet, LaSOT BBR norm loss loss(R)+SGD

Computer Vision and Image Understanding 222 (2022) 103508


SiamAttn (Yu et al., 2020) DL ResNet-50 Block-3,4,5 COCO, TrackingNet, binary classification, smooth 𝐿1 loss+binary nonlinear-update 45 Y N 2020
Youtube-VOS, LaSOT BBR cross-entropy loss+SGD
SiamBAN (Chen et al., DL ResNet-50 Block-3,4,5 COCO, ILSVRC2015 VID, binary classification, IoU loss+binary no-update 40 Y N 2020
2020) ILSVRC2015 DET, BBR cross-entropy+SGD
Youtube-BB, GOT-10k,
LaSOT
SiamR-CNN (Voigtlaender DL ResNet-101- Block-2,3,4,5 COCO, Youtube-VOS, binary classification, binary cross-entropy no-update 4.7 Y Y 2020
et al., 2020) FPN ILSVRC2015 VID, GOT-10k, BBR loss+huber
LaSOT loss+Momentum
PrDiMP50 (Danelljan DL ResNet-18/50 Block-4 LaSOT, GOT-10K, probabilistic density KL-divergence nonlinear-update 30 Y N 2020
et al., 2020) TrackingNet, COCO regression, loss+steepest descent
IoU-prediction method
D3S (Lukežič et al., 2020) DL ResNet-50 Block-1,2,3 Youtube-VOS binary prediction, binary cross-entropy no-update 48 Y N 2020
segmentation loss+Adam
(continued on next page)
F. Chen, X. Wang, Y. Zhao et al.
Table 7 (continued).
Trackers Frameworks Backbone Representation Training data Localization method Training scheme Update FPS GPU Re- Year
detection
CGACD (Du et al., 2020) DL ResNet-50 Block-4 YouTube-BB, GOT-10k, binary classification, logistic loss + 𝐿1 no-update 70 Y N 2020
ILSVRC2015 VID, COCO, BBR loss+elastic net
ILSVRC2015 DET loss+SGD
DCFST (Zheng et al., 2020) DL ResNet-18/50 Block-3,4 TrackingNet, GOT-10k, binary classification, shrinkage loss+Adam linear-update 35 Y N 2020
LaSOT BBR
SPM-Tracker (Wang et al., DL AlexNet conv2, conv4 ILSVRC2015 VID, binary classification, binary cross-entropy no-update 120 Y N 2019
2019a) Youtube-BB, COCO, BBR loss+smooth 𝐿1
ILSVRC2015 DET, person, loss+SGD
WiderFace
UDT (Wang et al., 2019b) DL – – ILSVRC2015 VID binary classification, 𝐿2 loss+SGD linear-update 70 Y 𝑁 2019
scale searching
DiMP50 (Bhat et al., 2019) DL ResNet-18/50 Block-3,4 TrackingNet, LaSOT, binary classification, 𝐿2 norm loss+Adam nonlinear-update 40 Y N 2019
GOT-10k, COCO IoU-prediction
MLT (Choi et al., 2019) DL AlexNet Conv5 ILSVRC2015DET, binary classification, logistic loss+Adam meta-update 48 Y N 2019
ILSVRC2017VID scale searching
GFS-DCF (Xu et al., 2019) DCF+DL ResNet-50 HOG, CN, – binary classification, 𝐿2 norm loss+ADMM linear update 8 Y N 2019
Intensity scale searching
Channels, res4x
GradNet (Li et al., 2019a) DL AlexNet conv2,3,4,5 ILSVRC2014 VID binary classification, logistic loss+SGD nonlinear-update 80 Y N 2019
scale searching
29

SiamRPN++ (Li et al., DL ResNet-50 Block-3,4,5 COCO, ILSVRC2015 DET, binary classification, binary cross-entropy no-update 35 Y N 2019
2019b) ILSVRC2015 VID, BBR loss+𝐿1 loss+SGD
Youtube-BB
ASRCF (Dai et al., 2019) DCF VGG-M, Norm1(VGG-M), – binary classification, 𝐿2 norm loss+ADMM linear-update 28 Y N 2019
VGG-16 Conv4-3(VGG- scale searching
16),
HOG
SiamDW (Zhang and Peng, DL CIResNet-22 Block-3 ILSVRC2015 VID, binary classification, logistic/binary no-update 150 Y N 2019
2019) Youtube-BB BBR cross-entropy
loss+smooth 𝐿1

Computer Vision and Image Understanding 222 (2022) 103508


loss+SGD
UpdateNet (Zhang et al., DL AlexNet conv5 LaSOT same as least square loss+SGD nonlinear-update – Y N 2019
2019a) SiamFC/DaSiamRPN
ATOM (Danelljan et al., DL ResNet-18 Block-3,4 LaSOT, TrackingNet, COCO binary classification, MSE+Adam(R), 𝐿2 nonlinear-update 30 Y N 2019
2019) BBR norm loss+Conjugate
Gradient(C)
BGDT (Huang et al., 2019) DL ResNet-50, L1, L2, GOT-10k binary classification, cross-entropy nonlinear-update 3 Y N 2019
VGG-16 L3(VGG-16) BBR loss+smooth 𝐿1
loss+SGD
SFT (Cui et al., 2019) DL VGG-19 conv4-2∼4-4, – binary classification, 𝐿2 norm loss+gradient linear-update 5 Y N 2019
conv5-2∼5–4 scale searching descent
DRL-IS (Ren et al., 2018) DL VGG-M fc4 VOT2013, VOT2014, action decision Q-function+Advantage nonlinear-update 10 Y Y 2018
VOT2015, ILSVRC2015 VID function+Adam
(continued on next page)
F. Chen, X. Wang, Y. Zhao et al.
Table 7 (continued).
Trackers Frameworks Backbone Representation Training data Localization method Training scheme Update FPS GPU Re- Year
detection
PAC (Zhang et al., 2018a) DCF – HOG, BMR – binary classification, 𝐿2 norm loss linear-update 45 N N 2019
scale searching
MemTrack (Yang and DL AlexNet conv5 ILSVRC2015 VID binary classification, logistic loss+Adam nonlinear-update 50 Y N 2018
Chan, 2018) scale searching
FlowTrack (Zhu et al., DCF+DL FeatureNET, CNN, flow ILSVRC2015 VID binary classification, 𝐿2 norm loss+SGD linear-update 12 Y N 2018
2018b) FlowNet scale searching
CPF (Zhang et al., 2018c) DCF – HOG – binary classification, 𝐿2 norm loss linear-update 8.3 N N 2018
particle filtering
VITAL (Song et al., 2018) T&D+DL VGG-M conv-1,2,3 OTB100, VOT13, VOT14, binary classification, binary cross-entropy nonlinear-update 1.5 Y N 2018
VOT15 candidates sampling +SGD
ACT (Chen et al., 2018) DL VGG-M fc6 ILSVRC2015 VID binary 𝐿2 loss+Adam, nonlinear-update 30 Y Y 2018
classification+action negative log
decision loss+SGD
BMR (Zhang et al., 2018d) PF – HOG, intensity, – binary classification, logistic loss+gradient nonlinear-update 5.26 N N 2018
CIE LAB color particle filtering descent
feature
30

SACF (Zhang et al., 2018f) DL CaffeNet fc3, conv2 ILSVRC2015 VID binary classification, 𝐿2 norm+𝐿1 norm linear-update 23 Y N 2018
scale searching+spatial loss+
transformation
MAM (Chen et al., 2019) DL VGG-16 conv4-3, TC-128, OTB-100 binary classification, binary cross-entropy nonlinear-update 3 Y Y 2018
conv5–3 candidates sampling loss+ SGD
STNCF (Zhang et al., DCF – HOG – binary classification, 𝐿2 norm loss+ADMM linear-update 5 N N 2018
2018b) scale searching
SiamRPN (Li et al., 2018c) DL modified conv5 ILSVRC2015 VID, binary classification, smooth 𝐿1 loss+ no-update 200 Y N 2018
AlexNet Youtube-BB BBR binary cross-entropy
loss+SGD

Computer Vision and Image Understanding 222 (2022) 103508


DaSiamRPN (Zhu et al., DL modified conv5 ILSVRC2015 VID, binary classification, smooth 𝐿1 loss+ linear-update 160 Y Y 2018
2018a) AlexNet Youtube-BB BBR binary cross-entropy
loss+SGD
DRT (Sun et al., 2018a) DCF+DL VGG-M, conv1(VGG-M), – binary classification, 𝐿2 norm loss+ nonlinear-update – Y N 2018
VGG-16 conv4-3(VGG- scale searching conjugate gradient
16), HOG, descent
CN
ACFT (Ma et al., 2018) DCF VGG-19 conv5-4, HOG, – binary classification, – linear-update 14.4 N Y 2018
HOI scale searching
(continued on next page)
F. Chen, X. Wang, Y. Zhao et al.
Table 7 (continued).
Trackers Frameworks Backbone Representation Training data Localization method Training scheme Update FPS GPU Re- Year
detection
DSLT (Lu et al., 2018) DL VGG-16 conv4-3, – binary classification, shrinkage loss+Adam nonlinear-update 5.7 Y N 2018
conv5–3 scale searching
StructSiam (Zhang et al., DL AlexNet – ILSVRC2014 VID binary classification, logistic loss+SGD no-update 45 Y 𝑁 2018
2018e) scale searching
RTINet (Yao et al., 2018) DL VGG-M conv3 ILSVRC2015 VID binary classification, 𝐿2 norm loss+SGD linear-update 24 Y N 2018
scale searching
RT-MDNet (Jung et al., DL VGG-M fc6 ILSVRC2015 VID binary classification, binary classification nonlinear-update 46/52 Y Y 2018
2018) BBR loss+instance
embedding loss+SGD
DeepSTRCF (Li et al., DCF+DL VGG-M conv3, HOG, CN – binary classification, 𝐿2 norm loss+ADMM linear-update 5.3 Y N 2018
2018a) scale searching
LSART (Sun et al., 2018b) DL VGG-16 conv4-3, HOG, – kernel ridge regression, 𝐿2 norm loss+SGD nonlinear-update 1 Y N 2018
CN scale searching
RASNet (Wang et al., DL – – ILSVRC2015 VID binary classification, logistic loss+SGD no-update 83 Y N 2018
2018) scale searching
NR-MVDLSR (Kang et al., PF – intensity, LBP, – particle filtering nonlocal regularized non-linear 1.1 N N 2018
2019) edge least square
loss+adaptive ADMM
31

SA-SIAM (He et al., 2018) DL AlexNet conv4, conv5 ILSVRC2015 VID binary classification, logistic loss+SGD no-update 50 Y N 2018
scale searching
CFCF (Gundogdu and DCF+DL VGG-M raw pixels, VOT2015, ILSVRC2015 VID binary classification, 𝐿2 norm loss+SGD linear-update 1.7 Y N 2018
Alatan, 2018) conv1-3, scale searching
conv5–2
TRACA (Choi et al., 2018) DCF+DL VGG-M conv3, conv4, VOC2012 binary classification, cross-entropy linear-update 101.3 Y Y 2018
conv5 scale searching loss+correlation filter
orthogonality loss+SGD
OAPT (Wang et al., 2018a) DCF+DL VGG-19 HOG, conv3-4, – binary classification, 𝐿2 norm loss linear-update 6 Y N 2018
conv4-4, scale searching

Computer Vision and Image Understanding 222 (2022) 103508


conv5–4
Meta-Tracker (Park and DL AlexNet conv1, conv2, ImageNet VID binary classification, logistic loss+anchor meta-update 82 Y N 2019
Berg, 2018) conv5 scale searching loss+Adam
p-track (Supančič III and DL VGG-16 – Internet videos binary classification, Q-function+SGD nonlinear-update 10fps Y Y 2017
Ramanan, 2017) candidates sampling
SiamTri (Dong and Shen, DL – – ILSVRC2015 VID binary classification, triplet loss+SGD same as baseline 55∼86 Y N 2018
2018) scale searching
DSST (Danelljan et al., DCF – PCA-HOG – binary classification, 𝐿2 norm loss linear-update 54.3 N N 2017
2017b) scale searching
CREST (Song et al., 2017) DL VGG16 conv4–3 – binary classification, 𝐿2 norm loss+Adam nonlinear-update 1 Y N 2017
scale searching
(continued on next page)
F. Chen, X. Wang, Y. Zhao et al.
Table 7 (continued).
Trackers Frameworks Backbone Representation Training data Localization method Training scheme Update FPS GPU Re- Year
detection
EAST (Huang et al., 2017a) DL AlexNet HOG, ILSVRC2015 VID binary classification, Q-function+𝜖- no-update 159 Y N 2017
conv-1,2,3,4,5 decision-making greedy+SGD
Obli-RaF (Zhang et al., T&D VGG-16 conv4-3, – binary classification, PSVM+recursive least linear-update 2 Y N 2017
2017b) conv5–3 candidates sampling square loss
BranchOut (Han et al., DL VGG-M conv1,2,3 VOT2013, VOT2014, BBR binary cross-entropy nonlinear-update 1 Y N 2017
2017) VOT2015, OTB-100 loss+SGD
PTAV (Fan and Ling, DCF+DL VGG-16 conv4-3, – cross-correlation, scale 𝐿2 norm loss linear-update 27 Y Y 2017
2017a) conv5–3 searching
ECO (Danelljan et al., DCF VGG-M conv1-3, – binary classification, 𝐿2 norm nonlinear-update 8 Y N 2017
2017a) conv5-2, HOG, scale searching loss+Gauss–Newton
CN method+CG
ACFN (Choi et al., 2017) DCF+DL – HOG, CN VOT14, VOT15 binary classification, MSE+Adam, 𝐿0 norm linear-update 15 Y Y 2017
scale searching loss+gradient descent
SANet (Fan and Ling, DL CNN+RNN fc6 VOT2013, VOT2014, binary classification, binary nonlinear-update 1 Y N 2017
2017b) VOT2015, OTB-100 linear BBR cross-entropy+SGD
TSN (Teng et al., 2017) DL VGG-19 conv1, conv2, – binary classification, 𝐿1 loss+𝐿2 norm nonlinear-update 1 Y N 2017
conv3 scale searching loss+SGD
CFNet (Valmadre et al., DCF+DL AlexNet conv1, conv2, ILSVRC2015 VID binary classification, logistic loss+SGD linear-update 83 Y N 2017
2017) conv5 scale searching
ADNet (Yun et al., 2017) DL VGG-M conv1, conv2, VOT2013, VOT2014, binary classification, binary cross-entropy nonlinear-update 3 Y Y 2017
conv3 VOT2015, ALOV300++ action selecting loss+SGD
32

MCPF (Zhang et al., CPF VGG-19 conv3-4, – binary classification, least square loss+APG linear-update 1.96 Y N 2017
2017c) conv4-4, particle sampling
conv5–4
CSR-DCF (Lukežič et al., DCF – HOG, CN, HSV – binary classification, 𝐿2 norm loss+ADMM linear-update 13 N N 2017
2017) scale searching
BACF (Galoogahi et al., DCF – HOG – binary classification, ridge linear-update 35 N N 2017
2017b) scale searching regression+ADMM
DSiam (Guo et al., 2017) DL AlexNet conv4, conv5 ILSVRC2015 VID binary classification, logistic loss+BPTT+ linear-update 25 Y N 2017
scale searching SGD
SiamFC (Bertinetto et al., DL AlexNet conv5 ILSVRC2015 VID binary classification, logistic loss+SGD no-update 86 Y N 2016

Computer Vision and Image Understanding 222 (2022) 103508


2016c) scales searching
Staple (Bertinetto et al., DCF – HOG, Color – binary classification, 𝐿2 norm loss linear-update 80 Y N 2016
2016b) histogram scale searching
MDNet (Nam and Han, T&D+DL VGG-M fc6 VOT2013, VOT2014, binary classification, binary nonlinear-update 1 N Y 2016
2016) VOT2015, OTB-100 BBR cross-entropy+SGD
SCT (Choi et al., 2016) DCF – HOG, average of – binary classification – linear-update 40 Y N 2016
RGB, lab color
STCT (Wang et al., 2016) DL VGG-16 conv4–3 – binary classification, hinge loss+least square nonlinear-update 2.5 Y N 2016
scale prediction loss+SGD
SINT (Tao et al., 2016) DL AlexNet or conv4-3, ALOV300++ template matching, margin contrastive no-update 4 Y N 2016
VGG-16 conv5-3, fc6 linear BBR loss+SGD
HDT (Qi et al., 2016) DCF+DL VGG-19 conv4-2∼4-4, – binary classification least square loss linear-update 10 Y N 2016
conv5-2∼5–4
(continued on next page)
F. Chen, X. Wang, Y. Zhao et al.
Table 7 (continued).
Trackers Frameworks Backbone Representation Training data Localization method Training scheme Update FPS GPU Re- Year
detection
CCOT (Danelljan et al., DCF+DL VGG-M-2048 raw pixels, – binary classification, least square nonlinear-update 0.3 Y N 2016
2016) conv1-3, scale searching loss+Conjugate
conv5–2 Gradient
GOTURN (Held et al., DL CaffeNet fc8 ALOV300++ BBR 𝐿1 loss+SGD no-update 165 Y N 2016
2016)
MMST (Chen et al., 2017b) PF – raw pixels – particle filtering 𝐿1 norm loss+APG linear-update 14.3 N N 2016
SST (Zhang et al., 2015) PF – grayscale image – particle filtering least square loss+APG linear-update – N N 2015
OTPS (Hua et al., 2015) T&D – HOG – binary classification – nonlinear-update 0.3 Y N 2015
MKCF (Tang and Feng, DCF – HOG, CN – binary classification, 𝐿2 norm linear-update 15 N N 2015
2015) scale searching loss+optimization
method
FCNT (Wang et al., 2015) DL VGG-16 conv4-3, – binary classification, least square loss+SGD nonlinear-update 1 Y N 2015
conv5–3 candidates sampling
DLRT (Sui et al., 2015) PF – grayscale image – particle filtering 𝐿0 minimization nonlinear-update 3 N N 2015
KCF (Henriques et al., DCF – HOG, raw pixels – binary classification, no kernel ridge regression linear-update 292 Y N 2015
2015) scale searching
SRDCF (Danelljan et al., DCF – HOG, grayscale – binary classification, least square loss linear-update 5 N N 2015
33

2015b) image, CN scale searching +Gauss–Seidel


optimization
HCFT (Ma et al., 2015a) DCF+DL VGG-19 conv3-4, – binary classification, no 𝐿2 norm loss linear-update 11 Y N 2015
conv4-4, scale searching
conv5–4
MUSTer (Hong et al., DCF – HOG, CN, SIFT – binary 𝐿2 norm loss linear-update 4 N Y 2015
2015) classification+keypoints
matching, scale
searching
DeepSRDCF (Danelljan DCF+DL VGG-M conv1–3 – binary classification, 𝐿2 norm loss linear-update <1 Y N 2015

Computer Vision and Image Understanding 222 (2022) 103508


et al., 2015a) scale searching
LCT (Ma et al., 2015c) DCF – HOG, intensity – binary classification, 𝐿2 norm loss+ linear-update 27 N Y 2015
histogram scale searching
SAMF (Li and Zhu, 2014) T&D – raw pixels, HOG, – binary classification, kernel ridge regression linear-update 7 N N 2014
Color Names scale searching
MEEM (Zhang et al., T&D – Lab color space – binary classification hinge loss nonlinear-update 10 N N 2014
2014c)
TLD (Kalal et al., 2011) T&D – raw pixels – Nearest neighbor – nonlinear-update 22 N Y 2012
classification
Struck (Hare et al., 2016) T&D – Haar, HOG, raw – – – – – N N 2011
pixels
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508

deep Siamese trackers is that the semantic deep features could easily improvement compared with those at a low frame rate, and even
lead to model drift when the target undergoes severe occlusion and outperform some deep trackers.
the local feature matching with convolution. However, some particle As shown in Figs. 22 and 23, the conventional DCF trackers such
filter based trackers and part-based trackers could avoid the model drift as Staple (Bertinetto et al., 2016b), SAMF (Li and Zhu, 2014), DSST
problem to a certain extent due to the particle sampling strategy or (Danelljan et al., 2014), and KCF (Henriques et al., 2015) cannot
detection with local features. From the success plot for the attribute perform well with hand crafted features in case of motion blur. The
of Partial Occlusion, the TrDiMP tracker outperforms its counterpart periodic assumption utilized in these approaches during training and
DiMP50 by 8% while not equipped with the re-detection scheme in tracking may cause the boundary effects, especially when the target has
SiamR-CNN, the main reasons for this improvement are the robust undergone severe motion blur. We see that the discriminative capacity
embedded template feature from a set of history frames and the global improved by applying a variety of regularizations, such as the spatial
cross-attention in the Transformer decoder. regularization in SRDCF (Danelljan et al., 2015b), contextual regular-
ization in BACF (Galoogahi et al., 2017b), and temporal regularization
9.2. Illumination in DeepSTRCF (Li et al., 2018a). The deep features help to signifi-
cantly enhance the robustness of tracking, such as the deep trackers
The illumination variations tend to happen with the changing with different layers of features (e.g., CFNet-conv1, CFNet-conv2, and
of light on the background and the target objects, as the examples CFNet-conv5).
(i.e., 𝐶𝑎𝑟4 and 𝑆𝑘𝑎𝑡𝑖𝑛𝑔1) presented in Fig. 30. FlowTrack (Zhu et al., Enlarging the size of training samples in CFLB (Kiani Galoogahi
2018b) addressed the illumination by warping operation, which pro- et al., 2015) and the motion blur introduced in the data augmenta-
vides multi-dimensional information from previous frames to improve
tion (Zhu et al., 2018a), have further improved the discriminative abil-
the discriminative ability. In the case of illumination changes, the
ity and robustness of the tracker to motion blur. Moreover, CFCF (Gun-
motion information can be used to warp multiple features from pre-
dogdu and Alatan, 2018) achieved a competitive performance on mo-
vious t-1 frames to the 𝑡th frame, which provide diverse information
tion blur attribute in Figs. 22 and 23, we attribute the main reason of
for the tracker. Different from DeepSRDCF (Danelljan et al., 2015a)
the improvement to the deep features from fine-tuned VGG-m (Chat-
that employed a single resolution deep feature, the tracking results of
field et al., 2014) model with 200k triplet samples collected from
CCOT (Danelljan et al., 2016) and ECO (Danelljan et al., 2017a) demon-
ILSVRC (Russakovsky et al., 2015) dataset, while ECO (Danelljan et al.,
strated that multi-resolution deep features can significantly improve the
2017a) and CCOT (Danelljan et al., 2016) directly utilized the pre-
performance of DCF trackers.
trained VGG-m model (Chatfield et al., 2014). In addition, the deep
Another solution to handle appearance variations such as illumi-
trackers such as STMTrack (Fu et al., 2021) and TrDiMP (Wang et al.,
nation is to construct a mixture model to improve the robustness.
2021b) perform much better than other trackers in terms of accuracy
Zhang et al. (2018c) proposed a correlation particle filter (CPF) model,
which extends the KCF and exploits multiple correlation filters into the and robustness for the attribute of motion blur. Both the trackers
particle framework to handle the illumination problem. Furthermore, benefit from a more stable and accurate target model learned by the
the multi-task correlation filter learned with different parts and fea- Transformer module from the memory of historical frames instead of
tures (Zhang et al., 2019b) of the target object can also be incorporated the model learned from a single frame such as SiamRPN++ (Li et al.,
into the tracking framework to alleviate the illumination problem. It 2019b) and SiamBAN (Chen et al., 2020).
is because of that multiple correlation filters with respect to different
kinds of features (e.g., intensity, Haar-like features, HOG, and deep 9.4. Deformation
features) can cover a wide range of appearance variations such as
illumination and occlusion.
Different from occlusion, illumination, and motion blur, the tar-
In addition to the spatial regularization in SRDCF (Danelljan et al.,
get deformation occurs due to the appearance changes as the 𝑏𝑖𝑟𝑑1
2015b), DeepSTRCF (Li et al., 2018a) added the temporal regulariza-
and 𝑆𝑘𝑎𝑡𝑖𝑛𝑔2-1 sequences shown in Fig. 30, 𝑓 𝑒𝑟𝑛𝑎𝑛𝑑𝑜 and 𝑔𝑦𝑚𝑛𝑠𝑡𝑖𝑐𝑠1
tion on correlation filters, in which the filters are passively updated in
sequences shown in Fig. 31. Both low-level and high-level semantic
the frames with small appearance variations. Similarly, STNCF (Zhang
features are important to adapt to the appearance changes such as
et al., 2018b) used spatio-temporal nonlocal regularization that ex-
deformation. In early works, Locally Orderless Matching (i.e., LOT, Avi-
plores long-term spatio-temporal nonlocal information on DCF to im-
dan et al., 2015), sparse representation (e.g., SCM, Zhong et al., 2012
prove the robustness to illumination variations. As shown in Fig. 22,
and ASLA, Jia et al., 2012), long-term memory (e.g., MUSTer, Hong
for the illumination variation attribute, DeepSTRCF (Li et al., 2018a)
achieves an AUC score of 66.3%, which outperforms the et al., 2015) are proposed to enhance the robustness of the track-
DeepSRDCF (Danelljan et al., 2015a) by 4.2%. DeepSTRCF also out- ers. Recently, Zhu et al. (2018b) utilized the Siamese network that
performs deep trackers such as DaSiamRPN (Zhu et al., 2018a) and consists of historical and current branches, which is similar to main-
SA-SIAM (He et al., 2018) by 0.8%, and 1.9%, respectively. taining a long-term memory model of target appearance for handling
severe deformation. ACFT (Ma et al., 2018) simultaneously maintained
9.3. Motion blur long-term and short-term memory to improve the robustness of the
tracker.
The motion blur often occurs when the target or camera moves In addition, deformable attention in SiamAtt (Yu et al., 2020),
fast, as the sequences (i.e., 𝑀𝑜𝑡𝑜𝑟𝑅𝑜𝑙𝑙𝑖𝑛𝑔 and 𝑆𝑜𝑐𝑐𝑒𝑟) shown in Fig. 30. depth-wise cross-correlation and ResNet-driven Siamese network in
In this scenario, it is difficult for a tracker to distinguish the contour SiamRPN++ (Li et al., 2019b), geometrically invariant model (GIM) in
or detect the local features (e.g., texture, color, and edge) of the D3S (Lukežič et al., 2020) enable them to achieve superior performance
target. The reasons for this include the low camera frame rate that compared with the conventional DCF trackers and shallow Siamese
makes it challenging to capture the fast-moving targets and the camera trackers, which is illustrated by the Rv and EAO scores in Table 4. In
shaking that may cause the image blurry. To validate the effectiveness conclusion, existing deep trackers still need to improve their ability to
of tracking performance with different frame rates, Galoogahi et al. handle deformation.
(2017a) introduced a tracking benchmark with a higher frame rate Deeper and wider networks, large-scale datasets, and a variety of
(i.e., 240 fps). The tracking performances at a high frame rate show sample augmentation techniques can improve the performance of deep
that some trackers with hand-crafted features can achieve considerable trackers when dealing with deformation.

34
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508

Fig. 31. Qualitative examples: tracking results of the challenging sequences in VOT2018 on 8 video sequences, including 𝑏𝑚𝑥, 𝑏𝑢𝑡𝑡𝑒𝑟𝑓 𝑙𝑦, 𝑐𝑟𝑎𝑏𝑠1, 𝑓 𝑒𝑟𝑛𝑎𝑛𝑑𝑜, 𝑓 𝑖𝑠ℎ1, 𝑓 𝑟𝑖𝑠𝑏𝑒𝑒, ℎ𝑎𝑛𝑑,
and 𝑔𝑦𝑚𝑛𝑎𝑠𝑡𝑖𝑐𝑠1. The ground truth bounding boxes are in white.

9.5. Scale variation 9.7. Background clutter

The scale variation means the aspect ratio of the target object Background clutter can lead to the model drift problem as the
changes dramatically due to appearance changes, viewpoint changes, or 𝑆𝑜𝑐𝑐𝑒𝑟 sequence shown in Fig. 30. The performance of trackers in
motion changes. Most CFs based trackers estimate the target scale by a cluttered scenes has a big gap compared with the performance of
one-dimensional multi-scale searching method. However, the limitation trackers under other challenges as shown in Fig. 26. The tracker
of this strategy is that it only generates samples with their width and TrDiMP (Wang et al., 2021b) obtains a much better success core on the
height scaled with the same scale factor 𝑎𝑟 , where 𝑎 and 𝑟 are the scale LaSOT dataset for the Background Clutter attribute. The target model
increment factor and scale level, respectively. GOTURN (Held et al., produced by the Transformer module with multiple historical feature
2016) directly predicts the bounding box with a regressor which was maps exhibits more discriminative ability compared to DiMP50 (Bhat
implemented as multiple fully-connected layers. However, this tracker et al., 2019) and SiamRPN++ (Li et al., 2019b). Among DCF trackers,
was learned under the assumption of motion smoothness. Later, in RPN instead of a single sample cropped according to the previous location,
based deep trackers, the dense sampling strategy with multiple scales multiple samples that are consecutively collected during tracking also
and aspect ratios is used to generate diverse candidate proposals. By se- make contributions to the stability of the learning target model such as
lecting the optimal bounding box with IoU-Net, the Precise RoI Pooling CCOT (Danelljan et al., 2016) and ECO (Danelljan et al., 2017a).
used in ATOM (Danelljan et al., 2019) enables us to iteratively refine
the bounding box by gradient ascent method, please refer to Jiang et al.
9.8. Fast motion
(2018) for detail. More recently, in anchor-free detectors, the width and
height of the target object are predicted by a regression model to adapt
Fast motion means the motion of the target between two adjacent
to its shape variations. In addition, CGACD (Du et al., 2020) uses the
frames is larger than the size of the target. As shown in Fig. 26, the
corner-detection to adapt to the scale changes of the target. As shown
highest success score under the attribute of Fast Motion achieves 52.2%
in Fig. 26, SiamR-CNN (Voigtlaender et al., 2020) achieves the best
which is the worst of all the attributes. Compared to the OTB-100
performance under the scale variation, which employs Cascade Faster
dataset, there are more video sequences in the LaSOT dataset with
R-CNN (Cai and Vasconcelos, 2018) with a ResNet-101-FPN backbone.
smaller and fast-moving targets. Most DCF or deep Siamese trackers
Compared to Faster R-CNN (Ren et al., 2015), Cascade Faster R-CNN
employ a local search strategy, which cannot handle the fast motion
can sequentially improve the quality of the hypotheses and hence the
challenge. The DCF tracker ECO (Danelljan et al., 2017a) only achieves
accuracy of the tracking result.
a success score of 23.3% on the LaSOT dataset for fast motion, which
9.6. Out-of-view illustrates that traditional trackers are more likely to lose the targets
when they move fast. One approach to alleviate such problems is to
When the target is partially or fully leaves the camera field of view, enlarge the search region such as DaSiamRPN (Zhu et al., 2018a) and
most of the trackers are easy to lose the target and have difficulty in Ocean (Zhang et al., 2020). However, enlarging the search region may
acquiring when it reappears. To address this problem, several trackers bring more distractors around the target. In this case, we need to care-
design a re-detection strategy to capture the target during long-term fully select hyper-parameters such as window influence in SiamRPN (Li
tracking, such as SiamR-CNN (Voigtlaender et al., 2020), DRL-IS (Ren et al., 2018c). Because a large window influence would suppress the
et al., 2018), ACT (Chen et al., 2018), ACFT (Ma et al., 2018), MD- real target location when it is far from the center of the response map.
Net (Nam and Han, 2016), and MUSTer (Hong et al., 2015). For this
attribute, TrDiMP (Wang et al., 2021b) and SiamR-CNN (Voigtlaender 9.9. Rotation
et al., 2020) obtain the best success scores of 68.3% and 62.2% on
OTB-100 and LaSOT datasets respectively. Most short-term tracking The rotation includes in-plane rotation and out-of-plane rotation.
methods do not have in-depth study of re-detection scheme, such as The in-plane rotation means that a target rotates in the image plane
ECO (Danelljan et al., 2017a), SiamFC (Bertinetto et al., 2016c), and as the MotorRolling and Skiing sequences shown in Fig. 30. The out-
SiamRPN++ (Li et al., 2019b). of-plane rotation means the target rotates out of the image plane such

35
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508

as the Skating1 sequence in Fig. 30. As shown in the success plots and the running average strategy with exponentially decaying weights over
precision plots of each attributes in Figs. 22 and 23, the accuracy of time. In GradNet (Li et al., 2019a), an optimized based meta-learner
most trackers for out-of-plane rotation is lower than the results of in- was used to update the template as a non-linear process, in which
plane rotation. For example, the top three trackers (e.g., STMTrack, the discriminative information of gradients is computed to update the
SiamGAT, and SiamBAN) achieve success scores of 70.7%, 70.7%, target feature. To alleviate the limitation of no online learning in
and 68.7% in Fig. 22, where they achieve 2.3%, 3.2%, and 0.7% SiamRPN (Li et al., 2018c) and linear template update strategy in
improvement for the attribute of in-plane rotation. For the reason that DaSiamRPN (Zhu et al., 2018a), ATOM (Danelljan et al., 2019) learned
the appearance changes of a target that undergoes out-of-plane rotation the classification model online to enhance the discriminative power of
are more severe than in-plane rotation. In addition, deep trackers the classifier. In Han et al. (2021), the templates from first 𝑓 𝑟𝑎𝑚𝑒 and
exhibit significant superiority over the traditional trackers without deep 𝑓 𝑟𝑎𝑚𝑒𝑡−1 are fused in a multi-head attention to produce the dynamic
features. To enhance the network capacity, data augmentations are part representations, guided by a target mask that is generated from
adopted for offline training in most deep trackers, such as flip, rotation, the ground truth bounding box.
shift, blur, and scale. Both Dai et al. (2020) and Zhang et al. (2021) combined the local
search and global search schemes for solving the long-term tracking
9.10. Low resolution problem. Zhang et al. (2021) applied an online learned verification
network (Jung et al., 2018) for identifying the target from the candi-
Low Resolution means the pixels in the target box are very small dates detected by a regression network (RPN), while Dai et al. (2020)
(e.g., 400 pixels in OTB-100). So, it is difficult to develop a dis- employed a RT-MDNet based verifier to further evaluate the results
criminative appearance model for traditional trackers. Although em- from the online-updated local tracker (e.g., ATOM and ECO). The
ploying multiple features and an effective model update strategy, the confidence of the verifier controls the tracker whether goes to global
ECO (Danelljan et al., 2017a) tracker only obtains a success score of search or continues to conduct local tracking in the next frame.
61.7% on OTB-100 for the attribute of low resolution. The DiMP50
(Bhat et al., 2019) tracker obtains a success score of 59.5% and a 9.12. Motion model
precision score of 81.4% even with the target model prediction module,
showing poor localization ability. This is in part due to the loss of To handle the partial occlusion, a motion model aims to describe the
information in deep neural networks under low resolution. The top temporal correlation of the target states between consecutive frames.
trackers such as SiamR-CNN, SiamCAR, and SA-SIAM employ the fea- The motion models include the Gaussian model, sliding window, ra-
ture pyramid or multiple layers of CNNs, and outperform TrDiMP and dius sliding window, cosine window, optical flow, temporal features,
DiMP50 by a large margin on the OTB-100 dataset as the success plots Region Proposal Network (RPN), Recurrent Neural Network (RNN), and
of low-resolution attributes shown in Fig. 22. However, DiMP50 (Bhat action-decision process. In SINT+ (Tao et al., 2016), optical flow is
et al., 2019) still shows high success score on LaSOT for the attribute used to filter out the motion inconsistent candidates. Most tracking-
of low resolution, compared to SiamBAN (Chen et al., 2020) and by-detection trackers use the sliding window approach to generate
SiamCAR (Guo et al., 2020). A potential reason is that DiMP50 contains candidate samples. Struck (Hare et al., 2016) generates the samples
an effective model update strategy when tracking long-term video within a radius. The cosine window is a modification of the simple
sequences. sliding window strategy. As a kind of motion model, the cosine window
puts more emphasis near the center of the target. In addition, it can
9.11. Model update suppress background regions or alleviate the boundary discontinuities
problem. In particle filter framework, the motion model is imple-
To adapt the target appearance changes, conventional DCF trackers mented as the transition model 𝑝(𝑥𝑡 |𝑥𝑡−1 ), where 𝑥𝑡 and 𝑥𝑡−1 denote
(e.g., MOSSE, Bolme et al., 2010, DSST, Danelljan et al., 2017b and the observations of frame 𝑡 and frame 𝑡-1. Generally, the transition
SRDCF, Danelljan et al., 2015a) usually update the model with an on- model 𝑝(𝑥𝑡 |𝑥𝑡−1 ) is implemented as a Gaussian distribution 𝑝(𝑥𝑡 |𝑥𝑡−1 ) =
line rule (Danelljan et al., 2014). For example, CCOT (Danelljan et al.,  (𝑥𝑡 ; 𝑥𝑡−1 , 𝛴), where 𝛴 is a diagonal covariance matrix whose elements
2016) learned the filters iteratively when a new frame comes. Whereas, are usually the corresponding variances of affine parameters. Recently,
the continuous model updating strategy is expensive and the model MemTrack (Yang and Chan, 2018) and STMTrack (Fu et al., 2021)
is sensitive to sudden appearance changes. Recently, ECO (Danelljan maintain a memory unit that contains multiple features of historical
et al., 2017a), MDNet (Nam and Han, 2016), and MEEM (Zhang et al., frames to adapt target variations. ConvGRU (Bhat et al., 2020) is
2014c) utilized a sparse update scheme to improve training efficiency another approach to capture the correspondence relationship between
and reduce computational load. Specifically, in MDNet (Nam and Han, two adjacent frames which can be used to propagate the target states
2016), they exploited two different update intervals including long- in consecutive frames. For the action-decision process, the trackers
term and short-term updates. Because the frequency of model update mainly learn to make decisions to search the target state with rein-
affects both the tracking speed and expressiveness of the model, most forcement learning. In Table 8, we present a detailed categorizations
of the existing online update methods employ an empirical criterion. of state-of-the-art trackers according to different motion models.
Afterward, the experimental results of ECO (Danelljan et al., 2017a)
show that infrequent model update can improve tracking performance. 9.13. Hyper-parameters tuning
DeepSTRCF (Li et al., 2018a) employed an online passive–aggressive
(PA) algorithm to learn the filters, which improves the robustness of In most trackers, there are a set of hyper-parameters that need to be
the tracker. In addition to the continuous or fixed interval update tuned during tracking for different datasets for the reason of different
strategy, a new criterion peak-versus-noise ratio (PNR) was introduced distribution of these tracking datasets. These hyper-parameters usually
by Zhu et al. (2018b) to control the model update at the right time. have significant impacts on tracking performance.
DiMP50 (Bhat et al., 2019) maintains a number of history samples for In SiamFC (Bertinetto et al., 2016c), important hyper-parameters
learning the prediction model, which is similar to the correlation filters are numScale, scale-step, and scale-LR. A common approach for finding
learning process in ECO (Danelljan et al., 2017a). While this strategy optimal hyper-parameters is the random search with a uniform distri-
decreases the tracking speed compared with other faster Siamese track- bution on a reasonable range for each parameter. The more parameters,
ers. An alternative way is to learn an update strategy automatically by the more combinations we need to evaluate on the test sets.
leveraging the merits of deep neural networks. Zhang et al. (2019a) In DCF trackers such as BACF (Galoogahi et al., 2017b) and
proposed the UpdateNet to learn an adaptive update strategy, instead of ECO (Danelljan et al., 2017a), more hyper-parameters need to be tuned

36
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508

Table 8
Trackers categorized by different motion models.
Motion model Trackers
Sliding window Struck (Hare et al., 2016)
Uniform sampling DCFST (Zheng et al., 2020)
Gaussian model CPF (Zhang et al., 2018c), MCPF (Zhang et al., 2017c), BranchOut (Han et al.,
2017), MMST (Chen et al., 2017b), MDNet (Nam and Han, 2016), FCNT (Wang
et al., 2015), SST (Zhang et al., 2015), ADNet (Yun et al., 2017), NR-MVDLSR (Kang
et al., 2019), GFS-DCF (Xu et al., 2019), RT-MDNet (Jung et al., 2018), ACT (Chen
et al., 2018), VITAL (Song et al., 2018), DLRT (Sui et al., 2018)
Cosine window MOSSE (Bolme et al., 2010), DSST (Danelljan et al., 2014), MemTrack (Yang and
Chan, 2018), SiamRN (Cheng et al., 2021), SiamGAT (Guo et al., 2021), PrDiMP50
(Danelljan et al., 2020), SianRPN (Li et al., 2018c), SiamRPN++ (Li et al., 2019b),
DiMP50 (Bhat et al., 2019), MLT (Choi et al., 2019), SPM-Tracker (Wang et al.,
2019a), Retina/FCOS-MAML (Wang et al., 2020), SiamBAN (Chen et al., 2020),
CGACD (Du et al., 2020), BGDT (Huang et al., 2019), SACF (Zhang et al., 2018f),
SA-SIAM (He et al., 2018), CFCF (Gundogdu and Alatan, 2018), TRACA (Choi et al.,
2018), Ocean (Zhang et al., 2020), ASRCF (Dai et al., 2019), PAC (Zhang et al.,
2018a), SiamTri (Dong and Shen, 2018), CFNet (Valmadre et al., 2017), DRT (Sun
et al., 2018a), ACFT (Ma et al., 2018), StructSiam (Zhang et al., 2018e), CREST
(Song et al., 2017), CSR-DCF (Lukežič et al., 2017), EAST (Huang et al., 2017a),
DSiam (Guo et al., 2017), SiamFC (Bertinetto et al., 2016c), BACF (Galoogahi et al.,
2017b), CCOT (Danelljan et al., 2016), ECO (Danelljan et al., 2017a), SCT (Choi
et al., 2016), Staple (Bertinetto et al., 2016b), HCFT (Ma et al., 2015a), KCF
(Henriques et al., 2015), LCT (Ma et al., 2015c), SRDCF (Danelljan et al., 2015b)
Temporal features STMTrack (Fu et al., 2021), MemTrack (Yang and Chan, 2018), MMST (Chen et al.,
2017b), MUSTer (Hong et al., 2015), STNCF (Zhang et al., 2018b), UpdateNet
(Zhang et al., 2019a)
RNN HART (Kosiorek et al., 2017), KYS (Bhat et al., 2020)
Flow information FlowTrack (Zhu et al., 2018b), SINT+ (Tao et al., 2016)
Action-decision process ADNet (Yun et al., 2017), ACT (Chen et al., 2018), DRL-IS (Ren et al., 2018)

than deep trackers, such as search area scale, number of scales, learning extraction, the strategy of the model update, and the hardware that the
rate for the online update, layers of deep neural networks used for algorithms are conducted on. The early DCF trackers with hand-crafted
feature representation, number of scales, scale step, parameters of the features such as MOSSE (Bolme et al., 2010) and KCF (Henriques
optimizer (e.g., regularization factor, iterations), spatial bandwidth of et al., 2015) have achieved comparable high speed, while the accuracy
2D Gaussian function, parameters of window function and so on. For and robustness still have a notable gap with state-of-the-art trackers.
most DCF trackers with a linear model update scheme, the learning rate By combining the deep and hand-crafted features, the accuracy and
has a significant impact on their tracking performance. robustness have been improved significantly, according to the results in
In SiamRPN (Li et al., 2018c) and SiamRPN++ (Li et al., 2019b), both Fig. 21 and Table 4. However, the speed is significantly decreased
there are three hyper-parameters include learning rate (lr), window if deep features are incorporated into conventional trackers such as
ECO (Danelljan et al., 2017a) and CCOT (Danelljan et al., 2016), in
influence, and penalty_k, which are in the range of [0,1]. Learning
which the filters are learned by solving the optimization problem in
rate is used as a damping factor for location and scale update. Win-
high dimensional feature space.
dow_influence is settled on the window function to penalize large
Apart from the breakthroughs in accuracy and speed that benefit
displacements. Penalty_k is used to suppress large changes in size and
from the end-to-end deep learning architecture, such as
ratio. Apart from the above three hyper-parameters in SiamRPN (Li SiamFC (Bertinetto et al., 2016c), EAST (Huang et al., 2017a), CFNet
et al., 2018c), Ocean (Zhang et al., 2020) tracker also contains a weight (Valmadre et al., 2017), FlowTrack (Zhu et al., 2018b), SA-SIAM (He
𝜔 to combine the outputs of two different classification networks. et al., 2018), SiamRPN (Li et al., 2018c), DaSiamRPN (Zhu et al.,
There is a toolkit that can be used to tune the hyperparameters 2018a), SiamRPN++ (Li et al., 2019b), and SiamBAN (Chen et al.,
such as Optuna (Akiba et al., 2019) which was used in Ocean (Zhang 2020), most Siamese network based trackers can take full advantages
et al., 2020). Optuna can help researchers construct the parameter of GPU and achieve high speed performance. In order to dynamically
search space dynamically and provide pruning strategies. In addition to select appropriate feature layers, EAST (Huang et al., 2017a) speeds
employing Optuna to find the optimal hyper-parameters automatically, up tracking by learning an agent to decide whether forward to the
an alternative way is to find optimal hyper-parameters in two steps. next layer during target locating or not, which achieved 158.9 fps on
Firstly, we can construct a large search space with a fixed step size for GPU. Recently, Siamese trackers, including SiamRPN (Li et al., 2018c)
a parameter. After evaluating the trackers in the large search space, we and DaSiamRPN (Zhu et al., 2018a), improve the tracking speed sig-
can find a suboptimal and small search space. Secondly, we utilize the nificantly due to their efficient detection module such as RPN (Region
Optuna for optimal parameter searching. Proposal Network). The speed of DaSiamRPN (Zhu et al., 2018a) can
There is only one hyperparameter (window influence) in TransT achieve 160 fps with comparable accuracy and robustness. However,
(Chen et al., 2021) tracker, the output of the detector is used as the it is a tradeoff between employing complex neural networks and
tracking result and the scale and displacement do not need to be exploiting fast tracking speed when designing a task-specific tracking
penalized. It seems that the fewer hyper-parameters a tracker contains, model.
the easier to find the optimal model for a test dataset.
10. Future directions and open issues

9.14. Speed Visual object tracking has been promoted by a variety of aspects,
including large-scale tracking datasets, high capacity of backbone net-
Tracking speed is an important metric for evaluating the trackers, works, training methods, model update schemes, object detection tech-
especially for the requirement of real-time processing in practical ap- nologies, and so on. In this section, we further present the future
plications. The key factors that affect the tracking speed include feature research directions and open issues for visual object tracking.

37
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508

• Lightweight Model Exploring a deep tracker is easy to be confused and lose the target. In KYS (Bhat
et al., 2020), scene information was represented as dense localized state
The backbone networks are important for the representation of a
vectors that are propagated during the sequence. The vectors provide
target object. In visual object tracking, we have witnessed multi-kinds
additional information to the appearance model for tracking. Besides
of neural networks for image recognition tasks, including AlexNet,
the images and the ground truth bounding boxes, the LaSOT dataset
VggNet-M, VGGNet-16, VGGNet-19, GoogLeNet, ResNet-18, ResNet-
also provides the additional language specification for every sequence.
50, and their variations. On the one hand, deeper and wider neural
Even a short description can provide very useful information about the
networks could enhance the model performance. On the other hand,
target and its scene, including the color, behavior, and surroundings of
they bring more computation workload and a large memory foot-
the target in the whole sequence. Exploring such language information
print. For real-world applications such as mobile devices and industrial
combined with the visual information in tracking is still an open issue.
robotics where the model size and compute resource are constrained,
The semantic information from language specification can help the
we need to construct lightweight neural networks with fewer params
tracker to locate the target in complex scenes.
and flops while maintaining high tracking performance. Neural Archi-
tecture Search (NAS) (Yan et al., 2021) has shown its great appeal and
advantages in deep learning, providing an effective way to search for 11. Conclusions
an optimal lightweight model by constraining the params and flops.
Therefore, it will be another promising research direction in visual In this paper, we presented a survey of traditional and deep methods
object tracking. for visual object tracking. We provided both the quantitative and
qualitative tracking results on multiple benchmarks. We analyzed the
• Model Update
advantages and disadvantages of state-of-the-art single object track-
To adapt to appearance changes of the target object, many matching ing algorithms in detail and made a comprehensive analysis of their
based trackers incrementally update their templates. Most CFs based performance on five tracking datasets. The generative trackers have
trackers linearly update their model at each frame. For deep trackers, advantages for handling challenging scenarios such as occlusion and
meta learning can be applied to update the model with samples col- large-scale variation via the particle sampling strategy, and they can
lected from current and previous frames on-the-fly. Dai et al. (2020) be combined with different appearance models (e.g., sparse represen-
learns a Meta-Updater with a cascaded LSTM as the geometric, discrim- tation, subspace representation, and motion energy). With the tracking-
inative, and appearance cues as input. This Meta-Updater determines by-detection scheme, many trackers attempted to build power dis-
whether the local tracker or verifier should be updated based on criminative classifiers with hand-crafted or deep features. The analysis
the current tracking state. In addition to most single template-based from previous sections indicates that an appropriate motion model
Siamese trackers, the space–time memory network (Fu et al., 2021) (summarized in Table 8) helps to improve the capacity for both the
which consists of multiple historical frames provides rich appearance generative and discriminative trackers.
information of the target for developing the target model. Therefore, The experimental results illustrated that recently proposed deep
employing the temporal features to update the target model online trackers achieved superior performance on public tracking datasets in
could be another attractive research direction in long sequence video terms of accuracy, robustness, and speed due to the powerful feature
tracking. extractor, accurate bounding box regressor, discriminative classifier,
and fully-convolutional networks, etc. In addition to conventional ways
• Combining the Local and Global Search Strategy of convolution or correlation (local matching) for information inter-
action in deep Siamese network trackers, deformable convolution or
Transformer approaches have recently become popular in many
Transformer extended to perform feature matching in a global manner.
kinds of vision tasks, such as image recognition, object detection, and
In practice, these two schemes can work together in a complementary
segmentation. Several attempts have been made to employ the Trans-
way to further improve the capacity of the tracker, and how to combine
former inside or outside the backbone network to improve the perfor-
the two kinds of operations in tracking is still an open issue for further
mance, please refer to Section 5.1.5. While, pure attention models such
study.
as SASA (Ramachandran et al., 2019), BoTNet (Srinivas et al., 2021),
ViT (Sharir et al., 2021), Pyramid Vision Transformer (PVT) (Wang Besides the detailed overview of the literature on state-of-the-art
et al., 2021a) still have not been explored to combine with Transform- trackers, we gave a summary of these trackers with different attributes
ers for visual tracking. Having the advantage of learning long-range in Table 7 to provide another comparison. We analyzed the differ-
dependencies, the self-attention (applied in one input branch) and ent tracking scenarios such as occlusion, deformation, scale variation,
cross-attention (between template and search branches) enable the model update, and hyper-parameters tuning in detail to further under-
information to interact with each other in a global manner, which is stand the characteristics of the trackers. At the end of this survey, we
different from convolution/correlation operation. Such a scheme can gave suggestions for open issues and listed potential future directions
improve the robustness of the tracker in dealing with challenging sce- in visual object tracking.
narios such as partial occlusion or large-scale deformation. Therefore,
combining Transformers with the Siamese network will be the next CRediT authorship contribution statement
evolution to facilitate the tracking research.
On the one hand, Transformer can be utilized to fuse template Fei Chen: Conceptualization, Methodology, Formal analysis, Vali-
features (either from single or multiple history frames) and search dation, Writing – original draft, Visualization. Xiaodong Wang: Fund-
features; on the other hand, how to combine local search in convolution ing acquisition, Methodology, Resources, Supervision. Yunxiang Zhao:
with global search in Transformer is still an open issue. Writing – review & editing, Visulization. Shaohe Lv: Conceptualization,
Investigation. Xin Niu: Supervision.
• Tracking with Multi-model Information

Existing tracking algorithms mainly search the target with vision infor- Declaration of competing interest
mation. In other words, they mainly localize the target by developing a
model to estimate the similarity between the target and search region. The authors declare that they have no known competing finan-
While the target model did not establish the relationship between the cial interests or personal relationships that could have appeared to
target and its scene. Because there are similar objects in the scene, even influence the work reported in this paper.

38
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508

Acknowledgments Chen, Z., You, X., Zhong, B., Li, J., Tao, D., 2017b. Dynamically modulated mask sparse
tracking. IEEE Trans. Cybern. 47, 3706–3718.
Chen, Z., Zhong, B., Li, G., Zhang, S., Ji, R., 2020. Siamese box adaptive network for
This work was supported by the Science and Technology Foundation
visual tracking. In: IEEE Conference on Computer Vision and Pattern Recognition.
of State Key Laboratory of Parallel and Distributed Processing, China pp. 6668–6677.
under Grant 6142110180405. Cheng, S., Zhong, B., Li, G., Liu, X., Tang, Z., Li, X., Wang, J., 2021. Learning to filter:
Siamese relation network for robust tracking. arXiv:2104.00829.
Choi, J., Chang, H.J., Fischer, T., Yun, S., Lee, K., Jeong, J., Demiris, Y., Choi, J.Y.,
References
2018. Context-aware deep feature compression for high-speed visual tracking. In:
IEEE Conference on Computer Vision and Pattern Recognition. pp. 479–488.
Adelson, E.H., Bergen, J.R., 1985. Spatiotemporal energy models for the perception of Choi, J., Chang, H.J., Yun, S., Fischer, T., Demiris, Y., Choi, J.Y., et al., 2017. Atten-
motion. J. Opt. Soc. Am. A 2, 284–299. tional correlation filter network for adaptive visual tracking. In: IEEE Conference
Akiba, T., Sano, S., Yanase, T., Ohta, T., Koyama, M., 2019. Optuna: A next-generation on Computer Vision and Pattern Recognition, Vol. 2, p. 7.
hyperparameter optimization framework. In: Proceedings of the 25rd ACM SIGKDD Choi, J., Jin Chang, H., Jeong, J., Demiris, Y., Young Choi, J., 2016. Visual tracking
International Conference on Knowledge Discovery and Data Mining. pp. 2623–2631. using attention-modulated disintegration and integration. In: IEEE Conference on
Arulampalam, M.S., Maskell, S., Gordon, N., Clapp, T., 2002. A tutorial on particle Computer Vision and Pattern Recognition. pp. 4321–4330.
filters for online nonlinear/non-gaussian Bayesian tracking. IEEE Trans. Signal Choi, J., Kwon, J., Lee, K.M., 2019. Deep meta learning for real-time target-aware visual
Process. 50, 174–188. tracking. In: IEEE International Conference on Computer Vision. pp. 911–920.
Avidan, S., 2007. Ensemble tracking. IEEE Trans. Pattern Anal. Mach. Intell. 29, Corbetta, M., Shulman, G.L., 2002. Control of goal-directed and stimulus-driven
261–271. attention in the brain. Nat. Rev. Neurosci. 3, 201–215.
Avidan, S., Levi, D., Barhillel, A., Oron, S., 2015. Locally orderless tracking. Int. J. Creswell, A., White, T., Dumoulin, V., Arulkumaran, K., Sengupta, B., Bharath, A.A.,
Comput. Vis. 111, 213–228. 2018. Generative adversarial networks: An overview. IEEE Signal Process. Mag. 35,
Babenko, B., Yang, M.-H., Belongie, S., 2011. Robust object tracking with online 53–65.
multiple instance learning. IEEE Trans. Pattern Anal. Mach. Intell. 33, 1619–1632. Cui, Z., Cai, Y., Zheng, W., Xu, C., Yang, J., 2019. Spectral filter tracking. IEEE Trans.
Bahdanau, D., Cho, K., Bengio, Y., 2014. Neural machine translation by jointly learning Image Process. 28, 2479–2489.
to align and translate. arXiv:1409.0473. Cui, Z., Xiao, S., Feng, J., Yan, S., 2016. Recurrently target-attending tracking. In: IEEE
Baker, S., Matthews, I., 2004. Lucas-kanade 20 years on: A unifying framework. Int. J. Conference on Computer Vision and Pattern Recognition. pp. 1449–1458.
Comput. Vis. 56, 221–255. Dai, K., Wang, D., Lu, H., Sun, C., Li, J., 2019. Visual tracking via adaptive spatially-
Bertinetto, L., Henriques, J.F., Valmadre, J., Torr, P., Vedaldi, A., 2016a. Learning feed- regularized correlation filters. In: IEEE Conference on Computer Vision and Pattern
forward one-shot learners. In: Advances in Neural Information Processing Systems. Recognition. pp. 4670–4679.
pp. 523–531. Dai, K., Zhang, Y., Wang, D., Li, J., Lu, H., Yang, X., 2020. High-performance long-term
Bertinetto, L., Valmadre, J., Golodetz, S., Miksik, O., Torr, P.H.S., 2016b. Staple: tracking with meta-updater. In: IEEE Conference on Computer Vision and Pattern
Complementary learners for real-time tracking. In: IEEE Conference on Computer Recognition. pp. 6297–6306.
Vision and Pattern Recognition. pp. 1401–1409. Danelljan, M., Bhat, G., Khan, F.S., Felsberg, M., 2017a. ECO: Efficient convolution
Bertinetto, L., Valmadre, J., Henriques, J.F., Vedaldi, A., Torr, P.H.S., 2016c. Fully- operators for tracking. In: IEEE Conference OnComputer Vision and Pattern
convolutional Siamese networks for object tracking. In: European Conference on Recognition. pp. 6931–6939.
Computer Vision. pp. 850–865. Danelljan, M., Bhat, G., Khan, F.S., Felsberg, M., 2019. ATOM: Accurate tracking
Bhat, G., Danelljan, M., Gool, L.V., Timofte, R., 2020. Know your surroundings: by overlap maximization. In: IEEE Conference on Computer Vision and Pattern
Exploiting scene information for object tracking. In: European Conference on Recognition. pp. 4660–4669.
Computer Vision. pp. 205–221. Danelljan, M., Häger, G., Khan, F., Felsberg, M., 2014. Accurate scale estimation
Bhat, G., Danelljan, M., Van Gool, L., Timofte, R., 2019. Learning discriminative model for robust visual tracking. In: British Machine Vision Conference, Nottingham,
prediction for tracking. In: IEEE International Conference on Computer Vision. pp. September 1-5, 2014. BMVA Press.
6182–6191. Danelljan, M., Hager, G., Khan, F.S., Felsberg, M., 2015a. Convolutional features
Bhat, G., Johnander, J., Danelljan, M., Shahbaz Khan, F., Felsberg, M., 2018. Unveiling for correlation filter based visual tracking. In: IEEE International Conference on
the power of deep tracking. In: European Conference on Computer Vision. pp. Computer Vision Workshop. pp. 621–629.
483–498. Danelljan, M., Hager, G., Khan, F.S., Felsberg, M., 2017b. Discriminative scale space
Bolme, D.S., Beveridge, J.R., Draper, B.A., Lui, Y.M., 2010. Visual object tracking using tracking. IEEE Trans. Pattern Anal. Mach. Intell. 39, 1561–1575.
adaptive correlation filters. In: IEEE Conference on Computer Vision and Pattern Danelljan, M., Hager, G., Shahbaz Khan, F., Felsberg, M., 2015b. Learning spatially
Recognition. pp. 2544–2550. regularized correlation filters for visual tracking. In: IEEE International Conference
Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J., 2011. Distributed optimization on Computer Vision. pp. 4310–4318.
and statistical learning via the alternating direction method of multipliers. Found. Danelljan, M., Robinson, A., Khan, F.S., Felsberg, M., 2016. Beyond correlation
Trends Mach. Learn. 3, 1–122. filters: Learning continuous convolution operators for visual tracking. In: European
Briechle, K., Hanebeck, U.D., 2001. Template matching using fast normalized cross Conference on Computer Vision. pp. 472–488.
correlation. In: Optical Pattern Recognition XII, vol. 4387, pp. 95–103. Danelljan, M., Van Gool, L., Timofte, R., 2020. Probabilistic regression for visual
Bromley, J., Guyon, I., LeCun, Y., Säckinger, E., Shah, R., 1994. Signature verification tracking. In: IEEE Conference on Computer Vision and Pattern Recognition. pp.
using a "Siamese" time delay neural network. In: Advances in Neural Information 7183–7192.
Processing Systems. pp. 737–744. Dekel, T., Oron, S., Rubinstein, M., Avidan, S., Freeman, W.T., 2015. Best-buddies
Cai, Z., Vasconcelos, N., 2018. Cascade R-CNN: Delving into high quality object similarity for robust template matching. In: 2015 IEEE Conference on Computer
detection. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. Vision and Pattern Recognition. pp. 2021–2029.
6154–6162. Dollár, P., Appel, R., Belongie, S., Perona, P., 2014. Fast feature pyramids for object
Caicedo, J.C., Lazebnik, S., 2015. Active object localization with deep reinforcement detection. IEEE Trans. Pattern Anal. Mach. Intell. 36, 1532–1545.
learning. In: IEEE International Conference on Computer Vision. pp. 2488–2496. Dong, X., Shen, J., 2018. Triplet loss in siamese network for object tracking. In:
Cannons, K., Gryn, J.M., Wildes, R.P., 2010. Visual tracking using a pixelwise spa- European Conference on Computer Vision. pp. 459–474.
tiotemporal oriented energy representation. In: European Conference on Computer Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., Van
Vision. pp. 511–524. Der Smagt, P., et al., 2015. Flownet: Learning optical flow with convolutional
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S., 2020. End- networks. In: IEEE International Conference on Computer Vision. pp. 2758–2766.
to-End object detection with transformers. In: European Conference on Computer Doucet, A., De Freitas, N., Gordon, N., 2001. An introduction to sequential Monte Carlo
Vision. pp. 213–229. methods. In: Sequential Monte Carlo Methods in Practice. pp. 3–14.
Čehovin, L., Leonardis, A., Kristan, M., 2016. Visual object tracking performance Dredze, M., Kulesza, A., Crammer, K., 2010. Multi-domain learning by confidence-
measures revisited. IEEE Trans. Image Process. 25, 1261–1274. weighted parameter combination. Mach. Learn. 79, 123–149.
Chatfield, K., Simonyan, K., Vedaldi, A., Zisserman, A., 2014. Return of the devil Du, F., Liu, P., Zhao, W., Tang, X., 2020. Correlation-guided attention for corner
in the Details: Delving deep into convolutional nets. In: British Machine Vision detection based visual tracking. In: IEEE Conference on Computer Vision and
Conference. Pattern Recognition. pp. 6836–6845.
Chen, B., Li, P., Sun, C., et al., 2019. Multi attention module for visual tracking. Pattern Duan, L., Tsang, I.W., Xu, D., Chua, T.-S., 2009. Domain adaptation from multiple
Recognit. 87, 80–93. sources via auxiliary classifiers. In: International Conference on Machine Learning.
Chen, Z., Luo, L., Huang, D., Wen, M., Zhang, C., 2017a. Exploiting a depth context pp. 289–296.
model in visual tracking with correlation filter. Front. Inf. Technol. Electron. Eng. Fan, H., Ling, H., 2017a. Parallel tracking and verifying: A framework for real-time
18, 667–679. and high accuracy visual tracking. In: IEEE International Conference on Computer
Chen, B., Wang, D., Li, P., Wang, S., Lu, H., 2018. Real-time’Actor-Critic’Tracking. In: Vision. pp. 5486–5494.
European Conference on Computer Vision. pp. 318–334. Fan, H., Ling, H., 2017b. SANet: Structure-aware network for visual tracking. In:
Chen, X., Yan, B., Zhu, J., Wang, D., Yang, X., Lu, H., 2021. Transformer tracking. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops. pp.
IEEE Conference on Computer Vision and Pattern Recognition. pp. 8126–8135. 2217–2224.

39
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508

Fan, H., Ling, H., 2019. Siamese cascaded region proposal networks for real-time visual Huang, R., Zhang, S., Li, T., He, R., 2017. Beyond face rotation: Global and local
tracking. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. perception GAN for photorealistic and identity preserving frontal view synthesis.
7952–7961. In: IEEE International Conference on Computer Vision. pp. 2439–2448.
Fan, H., Ling, H., Lin, L., Yang, F., Chu, P., Deng, G., Yu, S., Bai, H., Xu, Y., Liao, C., Huang, L., Zhao, X., Huang, K., 2018. GOT-10k: A large high-diversity benchmark for
0000. LaSOT Evaluation Toolkit, https://github.com/HengLan/LaSOT_Evaluation_ generic object tracking in the wild. arXiv:1810.11981.
Toolkit. Huang, L., Zhao, X., Huang, K., 2019. Bridging the gap between detection and tracking:
Fan, H., Ling, H., Lin, L., Yang, F., Chu, P., Deng, G., Yu, S., Bai, H., Xu, Y., Liao, C., A unified approach. In: IEEE International Conference on Computer Vision. pp.
2019. LaSOT: A high-quality benchmark for large-scale single object tracking. In: 3999–4009.
IEEE Conference on Computer Vision and Pattern Recognition. pp. 5374–5383. Isard, M., Blake, A., 1998. Condensation—Conditional density propagation for visual
Fan, J., Song, H., Zhang, K., Liu, Q., Lian, W., 2018. Complementary tracking via dual tracking. Int. J. Comput. Vis. 29, 5–28.
color clustering and spatio-temporal regularized correlation learning. IEEE Access Isola, P., Zhu, J.-Y., Zhou, T., Efros, A.A., 2017. Image-to-image translation with
6, 56526–56538. conditional adversarial networks. In: IEEE Conference on Computer Vision and
Fiaz, M., Mahmood, A., Javed, S., Jung, S.K., 2019. Handcrafted and deep trackers: Pattern Recognition. pp. 5967–5976.
Recent visual object tracking approaches and trends. ACM Comput. Surv. 52, 1–44. Jaderberg, M., Simonyan, K., Zisserman, A., et al., 2015. Spatial transformer networks.
Finn, C., Abbeel, P., Levine, S., 2017. Model-agnostic meta-learning for fast adaptation In: Advances in Neural Information Processing Systems. pp. 2017–2025.
of deep networks. In: IEEE International Conference on Machine Learning. pp. Ji, H., Ling, H., Wu, Y., Bao, C., 2012. Real time robust L1 tracker using accelerated
1126–1135. proximal gradient approach. In: IEEE Conference on Computer Vision and Pattern
Fu, Z., Liu, Q., Fu, Z., Wang, Y., 2021. Stmtrack: Template-free visual tracking with Recognition. pp. 1830–1837.
space-time memory networks. arXiv:2104.00324. Jia, X., Lu, H., Yang, M.-H., 2012. Visual tracking via adaptive structural local
Galoogahi, H.K., Fagg, A., Huang, C., Ramanan, D., Lucey, S., 2017a. Need for speed: A sparse appearance model. In: IEEE Conference on Computer Vision and Pattern
benchmark for higher frame rate object tracking. In: IEEE International Conference Recognition. pp. 1822–1829.
on Computer Vision. pp. 1134–1143. Jiang, B., Luo, R., Mao, J., Xiao, T., Jiang, Y., 2018. Acquisition of localization
Galoogahi, H.K., Fagg, A., Lucey, S., 2017b. Learning background-aware correlation confidence for accurate object detection. In: European Conference on Computer
filters for visual tracking. In: IEEE Conference on Computer Vision and Pattern Vision. Springer International Publishing, pp. 816–832.
Recognition. pp. 21–26. Jung, I., Son, J., Baek, M., Han, B., 2018. Real-time MDNet. In: European Conference
Gavves, E., Bertinetto, L., Henriques, J., Vedaldi, A., Torr, P., Tao, R., Valmadre, J., on Computer Vision. pp. 83–98.
2018. Long-term tracking in the wild: A benchmark. In: European Conference on Kalal, Z., Mikolajczyk, K., et al., 2011. Tracking-learning-detection. IEEE Trans. Pattern
Computer Vision. pp. 670–685. Anal. Mach. Intell. 34, 1409–1422.
Girshick, R., Donahue, J., Darrell, T., Malik, J., 2014. Rich feature hierarchies for Kang, B., Zhu, W.-P., Liang, D., Chen, M., 2019. Robust visual tracking via nonlocal
accurate object detection and semantic segmentation. In: IEEE Conference on regularized multi-view sparse representation. Pattern Recognit. 88, 75–89.
Computer Vision and Pattern Recognition. pp. 580–587. Khan, Z., Balch, T., Dellaert, F., 2004. A rao-blackwellized particle filter for
Gundogdu, E., Alatan, A.A., 2018. Good features to correlate for visual tracking. IEEE Eigentracking. In: IEEE Conference on Computer Vision and Pattern Recognition.
Trans. Image Process. 27, 2526–2540. Kiani Galoogahi, H., Sim, T., Lucey, S., 2013. Multi-channel correlation filters. In: IEEE
International Conference on Computer Vision. pp. 3072–3079.
Guo, Q., Feng, W., Zhou, C., Huang, R., Wan, L., Wang, S., 2017. Learning dynamic
Kiani Galoogahi, H., Sim, T., Lucey, S., 2015. Correlation filters with limited boundaries.
siamese network for visual object tracking. In: IEEE International Conference on
In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 4630–4638.
Computer Vision. pp. 1781–1789.
Kingma, D.P., Ba, J., 2021. An image is worth 16x16 words, What is a video
Guo, D., Shao, Y., Cui, Y., Wang, Z., Zhang, L., Shen, C., 2021. Graph attention tracking.
worth? arXiv:1412.6980.
arXiv:2011.11204.
Konda, V.R., Tsitsiklis, J.N., 2000. Actor-critic algorithms. In: Advances in Neural
Guo, D., Wang, J., Cui, Y., Wang, Z., Chen, S., 2020. SiamCAR: Siamese fully
Information Processing Systems. pp. 1008–1014.
convolutional classification and regression for visual tracking. In: IEEE Conference
Kosiorek, A., Bewley, A., Posner, I., 2017. Hierarchical attentive recurrent tracking. In:
on Computer Vision and Pattern Recognition. pp. 6269–6277.
Advances in Neural Information Processing Systems. pp. 3053–3061.
Hager, G.D., Belhumeur, P.N., 1998. Efficient region tracking with parametric models of
Kristan, M., Eldesokey, A., et al., 2017. The visual object tracking VOT2017 challenge
geometry and illumination. IEEE Trans. Pattern Anal. Mach. Intell. 20, 1025–1039.
results. In: IEEE International Conference on Computer Vision Workshop. pp.
Han, W., Huang, H., Yu, X., 2021. TAPL: Dynamic part-based visual tracking via
1949–1972.
attention-guided part localization. In: British Machine Vision Conference.
Kristan, M., Leonardis, A., Matas, J., et al., 2018. The sixth visual object tracking
Han, B., Sim, J., Adam, H., 2017. Branchout: Regularization for online ensemble
VOT2018 challenge results. In: European Conference on Computer Vision.
tracking with convolutional neural networks. In: IEEE International Conference on
Kristan, M., Leonardis, A., et al., 2016. The visual object tracking VOT2016 challenge
Computer Vision. pp. 2217–2224.
results. In: European Conference on Computer Vision Workshops, Vol. 8926, pp.
Hare, S., Golodetz, S., Saffari, A., Vineet, V., Cheng, M.-M., Hicks, S.L., Torr, P.H.,
191–217.
2016. Struck: Structured output tracking with kernels. IEEE Trans. Pattern Anal.
Kristan, M., Matas, J., Leonardis, A., et al., 2015. The visual object tracking VOT2015
Mach. Intell. 38, 2096–2109.
challenge results. In: IEEE International Conference on Computer Vision Workshops.
He, K., Gkioxari, G., Dollár, P., Girshick, R., 2017. Mask R-CNN. In: IEEE International
pp. 1–23.
Conference on Computer Vision. pp. 2961–2969.
Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. Imagenet classification with deep
He, A., Luo, C., Tian, X., Zeng, W., 2018. A twofold Siamese network for real-time convolutional neural networks. In: Advances in Neural Information Processing
object tracking. In: IEEE Conference on Computer Vision and Pattern Recognition. Systems. pp. 1097–1105.
pp. 4834–4843. Kwon, J., Lee, K.M., Park, F.C., 2009. Visual tracking via geometric particle filtering
He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition. on the affine group with optimal importance functions. In: IEEE Conference on
In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 770–778. Computer Vision and Pattern Recognition. pp. 991–998.
Held, D., Thrun, S., Savarese, S., 2016. Learning to track at 100 Fps with deep Ledig, C., Theis, L., Huszar, F., Caballero, J., Cunningham, A., et al., 2017. Photo-
regression networks. In: European Conference on Computer Vision. pp. 749–765. realistic single image super-resolution using a generative adversarial network. In:
Henriques, J.o.F., Caseiro, R., Martins, P., Batista, J., 2012. Exploiting the circu- IEEE Conference on Computer Vision and Pattern Recognition. pp. 4681–4690.
lant structure of tracking-by-detection with kernels. In: European Conference on Li, P., Chen, B., Ouyang, W., Wang, D., Yang, X., Lu, H., 2019a. GradNet: Gradient-
Computer Vision. pp. 702–715. guided network for visual object tracking. In: IEEE International Conference on
Henriques, J.F., Caseiro, R., Martins, P., Batista, J., 2015. High-speed tracking with Computer Vision. pp. 6162–6171.
kernelized correlation filters. IEEE Trans. Pattern Anal. Mach. Intell. 37, 583–596. Li, A., Lin, M., Wu, Y., Yang, M.H., Yan, S., 2016a. NUS-PRO: A new visual tracking
Hochreiter, S., Schmidhuber, J., 1997. Long short-term memory. Neural Comput. 9, challenge. IEEE Trans. Pattern Anal. Mach. Intell. 38, 335–349.
1735–1780. Li, X., Shen, C., Dick, A., Zhang, Z.M., Zhuang, Y., 2016b. Online metric-weighted
Hong, Z., Chen, Z., Wang, C., Mei, X., Prokhorov, D., Tao, D., 2015. MUlti-store Tracker linear representations for robust visual tracking. IEEE Trans. Pattern Anal. Mach.
(MUSTer): A cognitive psychology inspired approach to object tracking. In: IEEE Intell. 38, 931–950.
Conference on Computer Vision and Pattern Recognition. pp. 749–758. Li, M., Tan, T., Chen, W., Huang, K., 2012. Efficient object tracking by incremental
Hong, Z., Mei, X., Prokhorov, D., Tao, D., 2013. Tracking via robust multi-task multi- self-tuning particle filtering on the affine group. IEEE Trans. Image Process. 21,
view joint sparse representation. In: IEEE International Conference on Computer 1298–1313.
Vision. pp. 649–656. Li, F., Tian, C., Zuo, W., Zhang, L., Yang, M.-H., 2018a. Learning spatial-temporal
Horn, B.K.P., Schunck, B.G., 1981. Determining optical flow. Artificial Intelligence 17, regularized correlation filters for visual tracking. In: IEEE Conference on Computer
185–203. Vision and Pattern Recognition. pp. 4904–4913.
Hua, Y., Alahari, K., Schmid, C., 2015. Online object tracking with proposal selection. Li, P., Wang, D., Wang, L., Lu, H., 2018b. Deep visual tracking: Review and
In: IEEE International Conference on Computer Vision. pp. 3092–3100. experimental comparison. Pattern Recognit. 76, 323–338.
Huang, C., Lucey, S., Ramanan, D., 2017. Learning policies for adaptive tracking with Li, B., Wu, W., Wang, Q., Zhang, F., Xing, J., Yan, J., 2019b. Siamrpn++: Evolution of
deep feature cascades. In: IEEE International Conference on Computer Vision. pp. Siamese visual tracking with very deep networks. In: IEEE Conference on Computer
105–114. Vision and Pattern Recognition. pp. 4282–4291.

40
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508

Li, B., Xie, W., Zeng, W., Liu, W., 2019c. Learning to update for object tracking with Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, U., Shazeer, N., Ku, A., Tran, D., 2018.
recurrent meta-learner. IEEE Trans. Image Process. 28, 3624–3635. Image transformer. In: International Conference on Machine Learning.
Li, B., Yan, J., Wu, W., Zhu, Z., Hu, X., 2018c. High performance visual tracking with Qi, Y., Zhang, S., Qin, L., Yao, H., Huang, Q., Lim, J., Yang, M.-H., 2016. Hedged
Siamese region proposal network. In: IEEE Conference on Computer Vision and deep tracking. In: IEEE Conference on Computer Vision and Pattern Recognition.
Pattern Recognition. pp. 8971–8980. pp. 4303–4311.
Li, Y., Zhu, J., 2014. A scale adaptive kernel correlation filter tracker with feature Ramachandran, P., Parmar, N., Vaswani, A., Bello, I., Levskaya, A., Shlens, J., 2019.
integration. In: European Conference on Computer Vision. pp. 254–265. Stand-alone self-attention in vision models. arXiv:1906.05909.
Liang, P., Blasch, E., Ling, H., 2015. Encoding color information for visual tracking: Real, E., Shlens, J., Mazzocchi, S., et al., 2017. YouTube-BoundingBoxes: A large
Algorithms and benchmark. IEEE Trans. Image Process. 24, 5630–5644. high-precision human-annotated data set for object detection in Video. In: IEEE
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P., 2017. Focal loss for dense object Conference on Computer Vision and Pattern Recognition. pp. 5296–5305.
detection. IEEE Trans. Pattern Anal. Mach. Intell. PP, 2999–3007. Ren, S., He, K., Girshick, R., Sun, J., 2015. Faster R-CNN: Towards real-time object
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., detection with region proposal networks. In: Advances in Neural Information
Zitnick, C.L., 2014. Microsoft Coco: Common objects in context. In: European Processing Systems. pp. 91–99.
Conference on Computer Vision. pp. 740–755. Ren, L., Yuan, X., Lu, J., Yang, M., Zhou, J., 2018. Deep reinforcement learning with
Liu, F., Gong, C., Huang, X., Zhou, T., Yang, J., Tao, D., 2018. Robust visual tracking iterative shift for visual tracking. In: European Conference on Computer Vision. pp.
revisited: From correlation filter to template matching. IEEE Trans. Image Process. 684–700.
27, 2777–2790. Ross, D.A., Lim, J., et al., 2008. Incremental learning for robust visual tracking. Int. J.
Liu, T., Wang, G., Yang, Q., 2015. Real-time part-based visual tracking via adaptive Comput. Vis. 77, 125–141.
correlation filters. In: IEEE Conference on Computer Vision and Pattern Recognition. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z.,
pp. 4902–4912. Karpathy, A., Khosla, A., Bernstein, M., et al., 2015. Imagenet large scale visual
Lowe, D.G., 2004. Distinctive image features from scale-invariant keypoints. Int. J. recognition challenge. Int. J. Comput. Vis. 115, 211–252.
Comput. Vis. 60, 91–110. Sharir, G., Noy, A., Zelnik-Manor, L., 2021. An image is worth 16x16 words, What is
Lu, X., Ma, C., Ni, B., Yang, X., Reid, I., Yang, M.-H., 2018. Deep regression tracking a Video worth? arXiv:2103.13915.
with shrinkage loss. In: European Conference on Computer Vision. pp. 353–369. Smeulders, A.W., Chu, D.M., Cucchiara, R., Calderara, S., Dehghan, A., Shah, M., 2014.
Lukežič, A., Vojíř, T., Zajc, L.Č., Matas, J., Kristan, M., 2017. Discriminative correlation Visual tracking: An experimental survey. IEEE Trans. Pattern Anal. Mach. Intell. 36,
filter with channel and spatial reliability. In: IEEE Conference on Computer Vision 1442–1468.
and Pattern Recognition. pp. 4847–4856. Song, Y., Ma, C., Gong, L., Zhang, J., Lau, R.W., Yang, M.-H., 2017. Crest: Convolutional
Lukežič, A., Matas, J., Kristan, M., 2020. D3S-A discriminative single shot segmentation residual learning for visual tracking. In: IEEE International Conference on Computer
tracker. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. Vision. pp. 2574–2583.
7133–7142. Song, Y., Ma, C., Wu, X., Gong, L., Bao, L., Zuo, W., Shen, C., Lau, R.W.H., Yang, M.-
Lukežič, L.Č., Vojíř, T., Matas, J., Kristan, M., 2021. Performance evaluation H., 2018. VITAL: VIsual tracking via adversarial learning. In: IEEE Conference on
methodology for long-term single-object tracking. IEEE Trans. Cybern. 51, Computer Vision and Pattern Recognition. pp. 8990–8999.
6305–6318.
Song, H., Zheng, Y., Zhang, K., 2016. Robust visual tracking via self-similarity learning.
Lukežič, A., Zajc, L.Č., Vojíř, T., Matas, J., Kristan, M., 2018. Now you see me:
Electron. Lett. 53, 20–22.
Evaluating performance in long-term visual tracking. arXiv:1804.07056.
Srinivas, A., Lin, T.-Y., Parmar, N., Shlens, J., Abbeel, P., Vaswani, A., 2021. Bottleneck
Ma, C., Huang, J.-B., Yang, X., Yang, M.-H., 2015a. Hierarchical convolutional features
transformers for visual recognition. arXiv:2101.11605.
for visual tracking. In: IEEE International Conference on Computer Vision. pp.
Sui, Y., Tang, Y., Zhang, L., 2015. Discriminative low-rank tracking. In: IEEE
3074–3082.
International Conference on Computer Vision. pp. 3002–3010.
Ma, C., Huang, J.-B., Yang, X., Yang, M.-H., 2018. Adaptive correlation filters with
Sui, Y., Tang, Y., Zhang, L., Wang, G., 2018. Visual tracking via subspace learning: A
long-term and short-term memory for object tracking. Int. J. Comput. Vis. 1–26.
discriminative approach. Int. J. Comput. Vis. 126, 515–536.
Ma, L., Lu, J., Feng, J., Zhou, J., 2015b. Multiple feature fusion via weighted entropy
Sun, C., Wang, D., Lu, H., Yang, M.-H., 2018a. Correlation tracking via joint discrimina-
for visual tracking. In: IEEE International Conference on Computer Vision. pp.
tion and reliability learning. In: IEEE Conference on Computer Vision and Pattern
3128–3136.
Recognition. pp. 489–497.
Ma, C., Yang, X., Zhang, C., Yang, M.-H., 2015c. Long-term correlation tracking. In:
Sun, C., Wang, D., Lu, H., Yang, M.-H., 2018b. Learning spatial-aware regressions for
IEEE Conference on Computer Vision and Pattern Recognition. pp. 5388–5396.
visual tracking. In: IEEE Conference on Computer Vision and Pattern Recognition.
Marvasti-Zadeh, S.M., Cheng, L., Ghanei-Yakhdan, H., Kasaei, S., 2021. Deep learning
pp. 8962–8970.
for visual tracking: A comprehensive survey. IEEE Trans. Intell. Transp. Syst..
Supančič III, J., Ramanan, D., 2017. Tracking as online decision-making: Learning a
Mei, X., Ling, H., 2009. Robust Visual Tracking Using 𝓁1 Minimization. In: IEEE
policy from streaming videos with reinforcement learning. In: IEEE International
International Conference on Computer Vision. pp. 1436–1443.
Conference on Computer Vision. pp. 322–331.
Mei, X., Ling, H., Wu, Y., Blasch, E., Bai, L., 2011. Minimum error bounded efficient
Sutton, R.S., Barto, A.G., 1998. Introduction to Reinforcement Learning, Vol. 135.
𝓁1 tracker with occlusion detection. In: IEEE Conference on Computer Vision and
Pattern Recognition. pp. 1257–1264. Sutton, R.S., McAllester, D.A., Singh, S.P., Mansour, Y., 2000. Policy gradient methods
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., for reinforcement learning with function approximation. In: Advances in Neural
Riedmiller, M., 2013. Playing atari with deep reinforcement learning. arXiv:1312. Information Processing Systems. pp. 1057–1063.
5602. Tang, M., Feng, J., 2015. Multi-kernel correlation filter for visual tracking. In: IEEE
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., International Conference on Computer Vision. pp. 3038–3046.
Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., et al., 2015. Human-level Tao, R., Gavves, E., Smeulders, A.W.M., 2016. Siamese instance search for tracking. In:
control through deep reinforcement learning. Nature 518, 529. IEEE Conference on Computer Vision and Pattern Recognition. pp. 1420–1429.
Moudgil, A., Gandhi, V., 2017. Long-term visual object tracking benchmark. arXiv: Teng, Z., Xing, J., Wang, Q., Lang, C., Feng, S., Jin, Y., et al., 2017. Robust object
1712.01358. tracking based on temporal and spatial deep networks. In: IEEE International
Mueller, M., Bibi, A., Giancola, S., Alsubaihi, S., Ghanem, B., 2018. TrackingNet: A Conference on Computer Vision. pp. 1153–1162.
large-scale dataset and benchmark for object tracking in the wild. In: European Tian, Z., Shen, C., Chen, H., He, T., 2020. FCOS: Fully convolutional one-stage object
Conference on Computer Vision. detection. In: International Conference on Computer Vision. pp. 9627–9636.
Mueller, M., Smith, N., Ghanem, B., 2016. A benchmark and simulator for UAV Tsochantaridis, I., Joachims, T., Hofmann, T., Altun, Y., 2005. Large margin methods for
tracking. In: European Conference on Computer Vision. pp. 445–461. structured and interdependent output variables. J. Mach. Learn. Res. 6, 1453–1484.
Mueller, M., Smith, N., Ghanem, B., 2017. Context-aware correlation filter tracking. In: Ungerleider, L.G., Kastner, S., 2000. Mechanisms of visual attention in the human
IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, p. 6. cortex. Annu. Rev. Neurosci. 23, 315–341.
Nam, H., Han, B., 2016. Learning Multi-domain Convolutional Neural Networks for Valmadre, J., Bertinetto, L., Henriques, J., Vedaldi, A., Torr, P.H., 2017. End-to-end
Visual Tracking. In: IEEE Conference on Computer Vision and Pattern Recognition. representation learning for correlation filter based tracking. In: IEEE Conference
pp. 4293–4302. on Computer Vision and Pattern Recognition. pp. 5000–5008.
Newell, A., Yang, K., Deng, J., 2016. Stacked hourglass networks for human pose Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł.,
estimation. In: European Conference on Computer Vision. Springer, pp. 483–499. Polosukhin, I., 2017. Attention is all you need. In: Advances in Neural Information
Nguyen, H.T., Smeulders, A.W.M., 2004. Fast occluded object tracking by a robust Processing Systems, vol. 30.
appearance filter. IEEE Trans. Pattern Anal. Mach. Intell. 26, 1099–1104. Viola, P., Jones, M., 2001. Rapid object detection using a boosted cascade of simple
Nguyen, H.T., Smeulders, A.W.M., 2006. Robust tracking using foreground-background features. In: IEEE Conference on Computer Vision and Pattern Recognition, vol. 1,
texture discrimination. Int. J. Comput. Vis. 69, 277–293. p. I.
Ning, J., Yang, J., Jiang, S., Zhang, L., Yang, M.-H., 2016. Object tracking via dual Voigtlaender, P., Luiten, J., Torr, P.H.S., Leibe, B., 2020. Siam R-CNN: Visual tracking
linear structured SVM and explicit feature map. In: IEEE Conference on Computer by re-detection. In: IEEE Conference on Computer Vision and Pattern Recognition.
Vision and Pattern Recognition. pp. 4266–4274. pp. 6577–6587.
Park, E., Berg, A.C., 2018. Meta-tracker: Fast and robust online adaptation for visual Wang, Q., Gao, J., et al., 2017. DCFNet: Discriminant correlation filters network for
object trackers. In: European Conference on Computer Vision. pp. 569–585. visual tracking. arXiv:1704.04057.

41
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508

Wang, X., Hou, Z., Yu, W., Pu, L., Jin, Z., Qin, X., 2018a. Robust occlusion-aware Zhang, K., Fan, J., Liu, Q., et al., 2018a. Parallel attentive correlation tracking. IEEE
part-based visual tracking with object scale adaptation. Pattern Recognit. 81, Trans. Image Process. 28, 479–491.
456–470. Zhang, T., Ghanem, B., Liu, S., Ahuja, N., 2012a. Robust visual tracking via multi-task
Wang, X., Li, C., Yang, R., Zhang, T., Tang, J., Luo, B., 2018b. Describe and attend sparse learning. In: IEEE Conference on Computer Vision and Pattern Recognition.
to track: Learning natural language guided structural representation and visual pp. 2042–2049.
attention for object tracking. arXiv:1811.10014. Zhang, L., Gonzalezgarcia, A., De Weijer, J.V., Danelljan, M., Khan, F.S., 2019a.
Wang, D., Lu, H., Yang, M.-H., 2013. Online object tracking with sparse prototypes. Learning the model update for Siamese trackers. In: IEEE International Conference
IEEE Trans. Image Process. 22, 314–325. on Computer Vision. pp. 4010–4019.
Wang, G., Luo, C., Sun, X., Xiong, Z., Zeng, W., 2020. Tracking by instance detection: Zhang, T., Jia, K., Xu, C., Ma, Y., Ahuja, N., 2014a. Partial occlusion handling for
A meta-learning approach. In: IEEE Conference on Computer Vision and Pattern visual tracking via robust part matching. In: IEEE Conference on Computer Vision
Recognition. pp. 6287–6296. and Pattern Recognition. pp. 1258–1265.
Wang, G., Luo, C., Xiong, Z., Zeng, W., 2019a. SPM-tracker: Series-parallel matching Zhang, S., Lan, X., Yao, H., Zhou, H., Tao, D., Li, X., 2017a. A biologically inspired
for real-time visual object tracking. In: IEEE Conference on Computer Vision and appearance model for robust visual tracking. IEEE Trans. Neural Netw. Learn. Syst.
Pattern Recognition. 28, 2357–2370.
Wang, L., Ouyang, W., Wang, X., Lu, H., 2015. Visual tracking with fully con- Zhang, K., Li, X., Song, H., Liu, Q., Wei, L., 2018b. Visual tracking using
volutional networks. In: IEEE International Conference on Computer Vision. spatio-temporally nonlocally regularized correlation filter. Pattern Recognit. 83,
pp. 3119–3127. 185–195.
Wang, L., Ouyang, W., Wang, X., Lu, H., 2016. STCT: Sequentially training convolu- Zhang, T., Liu, S., Ahuja, N., Yang, M.-H., Ghanem, B., 2014b. Robust visual
tional networks for visual tracking. In: IEEE Conference on Computer Vision and tracking via consistent low-rank sparse learning. Int. J. Comput. Vis. 111,
Pattern Recognition. pp. 1373–1381. 171–190.
Wang, N., Song, Y., Ma, C., Zhou, W., Liu, W., Li, H., 2019b. Unsupervised deep Zhang, K., Liu, Q., Wu, Y., Yang, M.-H., 2016. Robust visual tracking via convolutional
tracking. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. networks without training. IEEE Trans. Image Process. 25, 1779–1792.
1308–1317. Zhang, T., Liu, S., Xu, C., Liu, B., Yang, M.-H., 2018c. Correlation particle filter for
Wang, Q., Teng, Z., Xing, J., Gao, J., et al., 2018. Learning attentions: Residual visual tracking. IEEE Trans. Image Process. 27, 2676–2687.
attentional Siamese network for high performance online visual tracking. In: IEEE Zhang, T., Liu, S., Xu, C., Yan, S., Ghanem, B., Ahuja, N., Yang, M.-H., 2015. Structural
Conference on Computer Vision and Pattern Recognition. pp. 4854–4863. sparse tracking. In: IEEE Conference on Computer Vision and Pattern Recognition.
Wang, A., Wan, G., Cheng, Z., Li, S., 2009. An incremental extremely random forest pp. 150–158.
classifier for online learning and tracking. In: IEEE International Conference on Zhang, K., Liu, Q., Yang, J., Yang, M.-H., 2018d. Visual tracking via Boolean map
Image Processing. pp. 1449–1452. representations. Pattern Recognit. 81, 147–160.
Wang, W., Xie, E., Li, X., Fan, D.-P., Song, K., Liang, D., Lu, T., Luo, P., Shao, L., 2021a. Zhang, J., Ma, S., Sclaroff, S., 2014c. MEEM: Robust tracking via multiple ex-
Pyramid vision transformer: A versatile backbone for dense prediction without perts using entropy minimization. In: European Conference on Computer Vision.
convolutions. arXiv:2102.12122. pp. 188–203.
Wang, N., Yeung, D.-Y., 2014. Ensemble-based tracking: Aggregating crowdsourced Zhang, Z., Peng, H., 2019. Deeper and wider Siamese networks for real-time visual
structured time series data. In: International Conference on Machine Learning. pp. tracking. In: IEEE Conference on Computer Vision and Pattern Recognition. pp.
1107–1115. 4591–4600.
Wang, N., Zhou, W., Wang, J., Li, H., 2021b. Transformer meets tracker: Exploiting Zhang, Z., Peng, H., Fu, J., Li, B., Hu, W., 2020. Ocean: Object-aware anchor-free
temporal context for robust visual tracking. In: IEEE Conference on Computer Vision tracking. In: European Conference on Computer Vision. pp. 771–787.
and Pattern Recognition. pp. 1571–1580. Zhang, C., Platt, J.C., Viola, P.A., 2006. Multiple instance boosting for object detection.
Wright, S., Nocedal, J., et al., 1999. Numerical optimization, vol. 35. Springer Science, In: Advances in Neural Information Processing Systems. pp. 1417–1424.
p. 7. Zhang, L., Suganthan, P.N., 2017. Robust visual tracking via co-trained kernelized
Wu, Y., Lim, J., Yang, M.-H., 0000. Online object tracking: A benchmark, http: correlation filters. Pattern Recognit. 69, 82–93.
//cvlab.hanyang.ac.kr/tracker_benchmark/benchmark_v10.html. Zhang, L., Varadarajan, J., Suganthan, P.N., Ahuja, N., Moulin, P., 2017b. Robust visual
Wu, Y., Lim, J., Yang, M.H., 2013. Online object tracking: A benchmark. In: IEEE tracking using oblique random forests. In: IEEE Conference on Computer Vision and
Conference on Computer Vision and Pattern Recognition. pp. 2411–2418. Pattern Recognition. pp. 5589–5598.
Wu, Y., Lim, J., Yang, M.-H., 2015. Object tracking benchmark. IEEE Trans. Pattern Zhang, Y., Wang, L., Qi, J., Wang, D., Feng, M., Lu, H., 2018e. Structured Siamese
Anal. Mach. Intell. 37, 1834–1848. network for real-time visual tracking. In: European Conference on Computer Vision.
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., Bengio, Y., pp. 351–366.
2015. Show, attend and tell: Neural image caption generation with visual attention. Zhang, Y., Wang, L., Wang, D., Qi, J., Lu, H., 2021. Learning regression and verification
In: International Conference on Machine Learning. pp. 2048–2057. networks for robust long-term tracking. Int. J. Comput. Vis..
Xu, T., Feng, Z.-H., Wu, X.-J., Kittler, J., 2019. Joint group feature selection and Zhang, M., Wang, Q., Xing, J., Gao, J., Peng, P., Hu, W., Maybank, S., 2018f. Visual
discriminative filter learning for robust visual object tracking. In: IEEE International tracking via spatially aligned correlation filters network. In: European Conference
Conference on Computer Vision, ICCV. pp. 7949–7959. on Computer Vision. pp. 469–485.
Xu, Y., Wang, Z., Li, Z., Yuan, Y., Yu, G., 2020. SiamFC++: Towards robust and accurate Zhang, T., Xu, C., Yang, M.-H., 2017c. Multi-task correlation particle filter for robust
visual tracking with target estimation guidelines. arXiv:1911.06188. object tracking. In: IEEE Conference on Computer Vision and Pattern Recognition,
Yan, B., Peng, H., Wu, K., Wang, D., Fu, J., Lu, H., 2021. LightTrack: Finding vol. 1, p. 3.
lightweight neural networks for object tracking via one-shot architecture search. Zhang, T., Xu, C., Yang, M.-H., 2019b. Learning multi-task correlation particle filters
arXiv:2104.14545. for visual tracking. IEEE Trans. Pattern Anal. Mach. Intell. 41, 365–378.
Yang, T., Chan, A.B., 2017. Recurrent filter learning for visual tracking. In: IEEE Zhang, S., Yao, H., Sun, X., Lu, X., 2013. Sparse coding based visual tracking: Review
International Conference on Computer Vision. pp. 2010–2019. and experimental comparison. Pattern Recognit. 46, 1772–1788.
Yang, T., Chan, A.B., 2018. Learning dynamic memory networks for object tracking. Zhang, K., Zhang, L., Yang, M.-H., 2012b. Real-time compressive tracking. In: European
In: European Conference on Computer Vision. pp. 152–167. Conference on Computer Vision. pp. 864–877.
Yang, K., Song, H., Zhang, K., Fan, J., 2019a. Deeper siamese network with multi-level Zhao, L., Zhao, Q., Chen, Y., Lv, P., 2016. Combined discriminative global and
feature fusion for real-time visual tracking. Electron. Lett. 55, 742–745. generative local models for visual tracking. J. Electron. Imaging 25, 023005.
Yang, K., Song, H., Zhang, K., Liu, Q., 2019b. Hierarchical attentive siamese network Zheng, L., Tang, M., Chen, Y., Wang, J., Lu, H., 2020. Learning Feature Embeddings for
for real-time visual tracking. Neural Comput. Appl. 1–12. Discriminant Model Based Tracking. In: European Conference on Computer Vision.
Yao, Y., Wu, X., Zhang, L., Shan, S., Zuo, W., 2018. Joint representation and truncated pp. 759–775.
inference learning for correlation filter based tracking. In: European Conference on Zhong, W., Lu, H., Yang, M.-H., 2012. Robust object tracking via sparsity-based collab-
Computer Vision. pp. 552–567. orative model. In: IEEE Conference on Computer Vision and Pattern Recognition.
Yilmaz, A., Javed, O., Shah, M., 2006. Object tracking: A survey. ACM Comput. Surv. pp. 1838–1845.
38, 13. Zhou, H., Fei, M., Sadka, A., Zhang, Y., Li, X., 2014. Adaptive fusion of particle
filtering and spatio-temporal motion energy for human tracking. Pattern Recognit.
Yu, Q., Dinh, T.B., Medioni, G., 2008. Online tracking and reacquisition using co-trained
47, 3552–3567.
generative and discriminative trackers. In: European Conference on Computer
Zhu, Z., Wang, Q., Li, B., Wei, W., Yan, J., 2018a. Distractor-aware Siamese networks
Vision. Springer, pp. 678–691.
for visual object tracking. In: European Conference on Computer Vision. pp.
Yu, Z., Xiang, B., Liu, W., Latecki, L.J., 2016. Similarity fusion for visual tracking. Int.
101–117.
J. Comput. Vis. 118, 337–363.
Zhu, Z., Wu, W., Zou, W., Yan, J., 2018b. End-to-end flow correlation tracking with
Yu, Y., Xiong, Y., Huangy, W., R. Scott, M., 2020. Deformable Siamese attention
spatial-temporal attention. In: IEEE Conference on Computer Vision and Pattern
networks for visual object tracking. In: IEEE Conference on Computer Vision and
Recognition. pp. 548–557.
Pattern Recognition. pp. 6727–6736.
Zhuang, B., Lu, H., Xiao, Z., Wang, D., 2014. Visual tracking via discriminative sparse
Yun, S., Choi, J., Yoo, Y., Yun, K., Choi, J.Y., 2017. Action-decision networks for visual
similarity map. IEEE Trans. Image Process. 23, 1872–1881.
tracking with deep reinforcement learning. In: IEEE Conference on Computer Vision
and Pattern Recognition. pp. 1349–1358.

42

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy