2022 Visual Object Tracking A Survey
2022 Visual Object Tracking A Survey
1. Introduction occlusion, stability of the camera, and the processing speed of the
trackers.
An object tracking algorithm tracks the object’s position in a 2D We have witnessed the rapid development of visual object track-
or 3D input from devices such as wireless sensor networks (wireless ing in the past few years, from sparse representations based trackers
signal), radar (radar echo), or cameras (video frames). Visual object (e.g., L1T, Mei and Ling, 2009), discriminative trackers (e.g., MIL,
tracking takes a 3D frame sequence as the input to track a target
Babenko et al., 2011, KCF, Henriques et al., 2015), to deep Siamese
object. Given the initialization of a specific target, visual object tracking
network based trackers (e.g., SiamFC, Bertinetto et al., 2016c and
tracks the trajectory of the target in the frame sequence. There are two
manners to optimize the trajectory of a target object during the tracking SiamRPN, Li et al., 2018c). Deep learning (DL) is an emerging tech-
process. The first is offline tracking, and the second is online tracking. nique for computer vision, and deep neural networks (DNN) have
Offline tracking (Smeulders et al., 2014) allows for global optimization shown great success in tasks such as image classification, object de-
of the tracking trajectory and is mainly for multiple objects tracking tection, image segmentation, motion recognition, and face recognition.
tasks. It scans forward and backward through all the frames of a Trackers based on deep learning have also achieved significant im-
sequence during the tracking procedure. Online tracking, on the other provement compared to traditional methods, including the trackers
hand, aims to estimate the states of the target in subsequent frames that combined hand-crafted features with deep features or end-to-end
given the state of the first frame. Different from offline tracking, online tracking frameworks based on deep neural networks.
tracking supports forward scanning only. There are survey papers on visual object tracking. In 2006, Yilmaz
Visual Object tracking can be applied to many applications, such
et al. (2006) proposed a survey that describes a general tracking process
as video surveillance, motion analysis, vehicle navigation, traffic mon-
and classifies tracking approaches based on the shape and appearance
itoring, and automatic robot navigation. For example, in video surveil-
representations of the target, including point tracking, kernel track-
lance, it is necessary to figure out the trajectory of a specific target. In
automatic robot navigation, a mobile robot needs to track and recog- ing, and silhouette tracking. The shape representations include points,
nize a moving object. The performance of the trackers is susceptible primitive geometric shapes, object silhouette and contour, articulated
to circumstances around the target and the variability of the target shape models, and skeletal models. Moreover, the appearance represen-
itself, such as the appearance representation of the target, illumina- tations are probability densities of object appearance, templates, active
tion changes, the moving speed of the target, circumstance changes, appearance models, and multi-view appearance models, providing us
∗ Corresponding authors.
E-mail addresses: chenfei14@nudt.edu.cn (F. Chen), xdwang@nudt.edu.cn (X. Wang), zhaoyx1993@163.com (Y. Zhao).
https://doi.org/10.1016/j.cviu.2022.103508
Received 10 July 2021; Received in revised form 17 April 2022; Accepted 11 July 2022
Available online 19 July 2022
1077-3142/© 2022 Elsevier Inc. All rights reserved.
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508
with a systematic analysis of the whole procedure of visual tracking. Recently, advanced bounding box regression approaches and training
Smeulders et al. (2014) evaluated nineteen different trackers emerged methods further enhance the discriminative ability and accuracy of the
from 1999 to 2012 with multiple evaluation metrics and proposed tracker. The relationship between object detection and visual object
a new dataset ALOV++ that covers more challenging circumstances. tracking has been strengthened. We can see that many state-of-the-art
Zhang et al. (2013) overviewed the trackers based on sparse coding trackers in literature benefit from the object detection technologies,
and conducted an experimental comparison of the results of repre- such as Region Proposal Networks, anchor-based or anchor-free detec-
sentative trackers. Li et al. (2018b) summarized the state-of-the-art tors. In addition, new proposed large scale datasets provide researchers
trackers based on deep learning and give a comprehensive experimental powerful means to build and train deeper and complex neural networks
comparison on three tracking datasets, including OTB-100 (Wu et al., (e.g., SiamRPN++, Li et al., 2019b, SiamDW, Zhang and Peng, 2019)
2015), TC-128 (Liang et al., 2015), and VOT2015 (Kristan et al., 2015). instead of shallow layers (e.g., FCNT, Wang et al., 2015, GOTURN, Held
While many challenging scenarios in visual object tracking need to be et al., 2016, and SINT, Tao et al., 2016) for learning the target model.
studied and addressed, especially with the development of machine Compared with traditional discriminative trackers, most deep trackers
learning and deep learning, the practical application requirements, can achieve higher tracking speeds due to their end-to-end frameworks.
and new large-scale datasets. Fiaz et al. (2019) reviewed recently However, one limitation of existing deep trackers is that they must
proposed trackers based on handcrafted features and deep learning. carefully design and finetune multiple hyper-parameters when tracking
This survey discussed trackers from two main categories: Correlation on a new dataset.
Filter (CF) based trackers and Non-CF based trackers. Marvasti-Zadeh The main contributions of this survey are listed as follows:
et al. (2021) summarized trackers based on deep learning technologies. 1. We give a comprehensive analysis of trackers that cover both the
They categorized the trackers from nine aspects, including network ar- conventional trackers and the deep learning based trackers.
chitecture, network exploitation, network training, network objective, 2. We experimentally evaluate the tracking results of state-of-the-art
network output, exploitation of correlation filter advantages, aerial- trackers on five important benchmark datasets.
view tracking, long-term tracking, and online tracking. Compared to 3. We discuss the eleven challenging scenarios, hyperparameter
previous surveys, the main differences of this survey are: (1) the cate- tuning, and motion models that influence the tracking performance.
gorization method in our paper is different and finer for discriminative 4. We give suggestions for future directions in this area based on
trackers than Fiaz et al. (2019); (2) we include more recent deep the advanced technologies of computer vision and deep learning.
trackers, and provide detailed experimental results on five tracking The rest of the paper is organized as follows. We discuss the gen-
datasets such as OTB-100, VOT, LaSOT, GOT-10k, and TrackingNet; erative trackers in Section 2 and review the discriminative trackers
(3) most of the traditional tracking methods are ignored in Marvasti- in Section 3. We discuss the collaborative trackers in Section 4 and
Zadeh et al. (2021), while we considered both the traditional and more review the trackers based on deep learning in Section 5. We present
recent deep trackers, we provide the detail experimental results for the evaluation methodologies for trackers in Section 6 and summarize
more recent state-of-the-art trackers compared to Marvasti-Zadeh et al. the tracking datasets in Section 7. We present the tracking results in
(2021); (4) the characteristics of each tracker with different attributes Section 8 and summarize challenging circumstances in visual object
are summarized in Table 7. tracking in Section 9. We discuss the model update, motion model,
In this survey, we mainly focus on online tracking algorithms pro- hyperparameter tuning, and tracking speed in Section 9. We discuss
posed in the past ten years, but also consider some early representa- the future directions and open issues in Section 10, and we conclude
tive works. To systematically analyze state-of-the-art trackers, we clas- remarks in Section 11.
sify existing methods into generative trackers, discriminative trackers,
collaborative trackers, and deep learning based trackers. Generative 2. Generative trackers
trackers establish the appearance model of the target in a continuous
sequence to search the region most similar to the target object. The In this section, we give a comprehensive review of generative track-
appearance model plays a key role in the process of tracking. Many ers. Given the initial information (e.g., ground truth bounding box)
approaches have been developed for learning an appearance model, of the target object in the first image, the generative trackers mainly
such as subspace learning, sparse representations, spatio-temporal mo- search for the most similar region in the test image. Generally, the
tion energy, and boolean map representations. While discriminative generative trackers can be categorized into template matching based
trackers formulate the visual object tracking as a binary classification trackers and particle filter based trackers (detailed in Sections 2.1 and
problem, in which a classifier is trained to distinguish the target from 2.2, respectively). In particle based trackers, the target state can be
the background. The representative methods include MIL (Babenko determined by the maximum a posteriori estimation given the previous
et al., 2011), Struck (Hare et al., 2016), DLSSVM (dual linear Structured observation set of the target up to the current frame. This process is
SVM) (Ning et al., 2016), and correlation filter based trackers. The mainly related to a state transition model and an observation model
trackers equip with a correlation filter significantly accelerate the (detailed in Section 2.2). In addition, multiple appearance models can
learning and detection with convolution theorem and Fast Fourier be integrated into the particle filtering framework conveniently.
Transform (FFT) (Bolme et al., 2010). To adapt to the appearance
variation of a target, these trackers allow the classifiers or correlation 2.1. Generative trackers based on template matching
filters (CFs) used by them to be trained online. Collaborative trackers
take advantage of both generative and discriminative ways for tracking. In template matching, an object is represented by the target tem-
Benefiting from the great success of deep learning technologies in plates, and the tracking process is performed by matching the repre-
computer vision, a tremendous number of deep trackers have been sentation of the target in the current frame with previous templates
proposed, which can be classified into two categories: (1) the first via a similarity measure method. For example, a fast normalized cross-
category combines deep features with traditional tracking algorithms; correlation algorithm is applied in NCC (Briechle and Hanebeck, 2001)
(2) the second category is implemented with deep neural networks to perform template matching to reduce the computation load directly.
and trained in an end-to-end manner. Early trackers exploited to uti- In KLT (Baker and Matthews, 2004), the affine warp is utilized to find
lize deep features from a pre-trained model (e.g., imagenet-vgg-2048 the matching between the template image 𝑇 (𝑥) and input image 𝐼(𝑥).
network, Chatfield et al., 2014) on other visual tasks. Afterward, sev- In KAT (Nguyen and Smeulders, 2004), the appearance features are
eral works formulated trackers as a convolutional network for similar smoothed temporally by robust Kalman filters to update the template
object detection or to regress to the target location directly, which for handling occlusions. Then the target matching is performed by
can be trained end-to-end in an offline phase with large-scale datasets. optimizing the robust error function (Hager and Belhumeur, 1998;
2
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508
Baker and Matthews, 2004) between two image patches, which makes with a set of additional new images. However, the update efficiency
the error metric more robust to the outliers. significantly influences the tracking performance. In IVT (Ross et al.,
The constraint of trackers based template matching is that they 2008), an incremental learning algorithm was proposed to update mean
mainly exploit the spatial information through template but less adap- and eigenbasis accurately and efficiently in the low-dimensional sub-
tivity when the target undergoes non-rigid transformations. In LOT space. Such a learning strategy can incrementally incorporate new data
(Avidan et al., 2015), the utilized Locally Orderless Matching in the with less space and time complexity and adapt online to appearance
Earth Mover’s Distance (EMD) optimization problem is more robust to changes of the target. In Fig. 1, we give a detailed illustration of the in-
occlusion and non-rigid transformations compared to IVT (Ross et al., cremental learning and tracking based on particle filtering framework.
2008), MIL (Babenko et al., 2011), and TLD (Kalal et al., 2011). Yu et al. PMT (Zhang et al., 2014a) employed the part based appearance model
(2016) integrated multiple features into a unified similarity measure to represent the target and the part matching methods to obtain the
and the relationship among samples in the forthcoming frame is utilized occlusion information of individual parts. In TAG (Kwon et al., 2009),
to enhance this similarity measure. By extending the similarity measure the 2D affine motion was reformulated as the particle filtering problem
Best-Buddies Similarity (BBS) proposed in Dekel et al. (2015), Liu on the 2D affine group 𝐴𝑓 𝑓 (2).
et al. (2018) applied the Mutual Buddies Similarity (MBS) to find the
optimal candidate region that suits the template, which is updated via 2.2.2. Sparse representation
a memory filtering strategy. The difference between MBS and BBS is There is extensive literature on sparse representation based trackers,
that the former utilizes the similarity metric MBP between two patches, in which the target can be represented as a sparse linear combination of
which is computed among multiple reciprocal nearest neighbors. While target templates. To handle the occlusion problem, Mei and Ling (2009)
the similarity metric BBP used in BBS is the special case of BBS where firstly proposed the sparsity representation composed of a set of target
the number of nearest neighbors of a patch is 1. and trivial templates for visual object tracking. The sparsity can be ob-
tained by solving an 𝓁1 -regularized optimization problem. Meanwhile,
2.2. Generative trackers based on particle filter the dictionary templates update dynamically according to appearance
changes to preserve robustness. Afterward, many related works have
The particle filter or Sequential Monte Carlo (SMC) (Isard and Blake, been done to improve the performance of sparse representation models.
1998) model, which approximates the posterior probability of state For example, the sparsity-inducing 𝓁𝑝,𝑞 mixed norms (𝑝 ≥ 1, 𝑞 ≥ 1)
variables with a finite set of particles sampled from the state space, pro- based Multi-task learning in MTT (Zhang et al., 2012a), multi-view
vides a convenient framework for visual object tracking (Arulampalam (e.g., color, shape and texture) based multi-task learning that exploits
et al., 2002; Isard and Blake, 1998) due to their generality, flexibility, both the relationship between particles and views in MTMVT (Hong
and simple implementation. In addition, the particle filters used for et al., 2013), subspace learning with PCA and sparse representation
propagating sample distributions over time are effective to handle the for appearance model in Wang et al. (2013), a discriminative sparse
non-Gaussianity and multi-modality problems. It consists of two stages, similarity (DSS) map that combines coefficients of positive and negative
namely prediction and update. Given all available observations 𝑧1∶𝑡−1 = templates constructed in Zhuang et al. (2014), nonlocal regularized
{ }
𝑧1 , 𝑧2 , … , 𝑧𝑡−1 , the prediction stage predicts the posterior probability multi-view sparse representation in NR-MVDLSR (Kang et al., 2019),
of 𝑥𝑡 with state transition model 𝑝(𝑥𝑡 |𝑥𝑡−1 ) as follows: the joint structural sparse appearance model in SST (Zhang et al.,
2015), and metric-weighed linear representation in Li et al. (2016b).
𝑝(𝑥𝑡 |𝑧1∶𝑡−1 ) = 𝑝(𝑥𝑡 |𝑥𝑡−1 )𝑝(𝑥𝑡−1 |𝑧1∶𝑡−1 )𝑑𝑥𝑡−1 (1)
∫
CLREST (Zhang et al., 2014b) exploited to learn a robust linear
At time 𝑡, the observation 𝑧𝑡 is available, and the state vector is
representation by solving a low-rank, consistent, and sparse learning
updated using Bayesian rule as:
problem, which corresponds to the nuclear norm, 𝓁2,1 norm, and 𝓁1
𝑝(𝑧𝑡 |𝑥𝑡 )𝑝(𝑥𝑡 |𝑧1∶𝑡−1 ) norm, respectively. More specifically, the low-rank representation can
𝑝(𝑥𝑡 |𝑧1∶𝑡 ) = (2)
𝑝(𝑧𝑡 |𝑧1∶𝑡−1 ) be implemented with nuclear norm ‖𝐙‖∗ to force a joint representation
where 𝑝(𝑧𝑡 |𝑥𝑡 ) is the observation likelihood, and 𝑝(𝑧𝑡 |𝑧1∶𝑡−1 ) denotes of particles of the target rather than independent, where 𝐙 is the repre-
the normalizing constant. The required posterior 𝑝(𝑥𝑡 |𝑧1∶𝑡 ) in Eq. (2) sentation of the corresponding observation 𝐗 of particles. Afterwards,
is approximated by a set of samples (‘particles’) {𝑥𝑖𝑡 }𝑖=1,…,𝑁 with as- the representation in the current frame 𝐗 and the previous frame 𝐗0
sociated weights 𝜔𝑖𝑡 . The candidate samples 𝑥𝑖𝑡 are drawn from an are compared using 𝓁2,1 norm denoted as ‖𝐙 − 𝐙0 ‖2,1 , in which the
importance distribution 𝑞(𝑥𝑡 |𝑥1∶𝑡−1 , 𝑧1∶𝑡 ), where a number of sampling 𝓁2,1 norm encourages the temporal consistency of the representations
algorithms can be used such as importance sampling (Isard and Blake, of tracking results, where 𝐙 and 𝐙0 are the representations of current
1998), Sequential Importance Sampling (SIS) (Doucet et al., 2001), and previous frames, separately. There is an example of this tracker
Rao-Blackwellized Particle Filter (Khan et al., 2004), and ISPF (Li et al., in Fig. 2. Due to the limitation of 𝓁𝑝 norm regularized least square
2012). optimization that does not consider the correlation between feature
In visual object tracking, to improve the robustness of trackers, dimensions, which is important for object/non-object classification
particle filters are widely used to estimate the target states from a with complicated appearance variations, Li et al. (2016b) introduced
time series state space model, which enables the trackers to deal with the metric-weighted linear representation learned with a discriminative
appearance changes effectively. Different kinds of appearance models Mahalanobis metric matrix. In addition, Chen et al. (2017b) proposed
and features can be incorporated into such framework, such as subspace a dynamically modulated mask sparse tracking method, in which the
learning (e.g., IVT, Ross et al., 2008; Sui et al., 2018), sparse representa- mask templates produced by frame difference can model the corruption
tion (e.g., L1T, Mei and Ling, 2009 and BPR-L1, Mei et al., 2011), and on target more precisely than trivial templates that utilized in early
motion energy combined with color histogram in Zhou et al. (2014), sparse representation based trackers.
etc. We summarize the classical works as follows.
2.2.3. Multi-task learning
2.2.1. Subspace learning Particle filters for tracking improve the robustness of a tracker by
In visual object tracking, the subspace representation of the target sampling a sufficient number of samples. While a dense sampling strat-
object, usually a eigenbasis computed from the singular value decom- egy generally leads to a high computational load. Besides the sparse
position, provides a compact notion of the ‘thing’ being tracked rather representation, multi-task learning has been used in MTT (Zhang et al.,
than treating the target as a set of independent pixels in an image. To 2012a), MTMVT (Hong et al., 2013), and MCPF (Zhang et al., 2017c).
adapt to appearance changes, the tracker needs to retrain the eigenbasis Fig. 3 shows the example of the structure of the learned coefficient
3
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508
Fig. 1. Illustration of subspace learning example in IVT (Ross et al., 2008), where MAP denotes the Maximum a Posteriori estimation used for searching the most likely particle.
𝑈 ′ , 𝜎 ′ , and 𝐼̄𝑐 are the updated basis vectors, singular values, and mean vectors, respectively. During the tracking process in frame 𝑡, a set of particles (candidate states) are drawn
from the particle filter with the dynamical model, which is implemented as a Gaussian model. For each particle 𝑖, we estimate its likelihood of observation 𝑝(𝑍𝑡𝑖 |𝑋𝑡𝑖 ). Then, the
target state can be determined by computing the MAP among all particles. Finally, the tracking result can be used to update the appearance model.
Fig. 3. Illustration for the multi-view and multi-task learning example (Hong et al.,
Fig. 2. Illustration of Low rank and sparse learning in CLREST (Zhang et al., 2014b).
2013). The 𝑣𝑖𝑒𝑤1, 𝑣𝑖𝑒𝑤2, and 𝑣𝑖𝑒𝑤3 in dictionary M represent different features. The
z⃗𝑂 ⃗𝐵𝑖 represent the target object and background templates respectively. We can
𝑖 and z row sparse matrix P and column sparse matrix Q construct the overall coefficient matrix
see that the columns of coefficients matrix 𝑍 are sparse and few dictionary templates
with respect to dictionary M.
are used for representing the particles (e.g., particle 𝑥⃗𝑖 , 𝑥⃗𝑗 , and 𝑥⃗𝑘 ).
4
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508
Fig. 4. Illustration for comparison of Struck (Hare et al., 2016) and MIL (Babenko
Recently, Correlation Filter (CF) based trackers have drawn much
et al., 2011). MIL learns the classifier with positive and negative bags, while the Stuck
avoids the labeling procedure and trains the classifier directly on the tracking output attention due to their high efficiency of computation and adaptivity.
(bounding box). The developments of DCF for tracking can be categorized into two
trends, one is the application of improved features, such as the deep
features used in CCOT (Danelljan et al., 2016), ECO (Danelljan et al.,
2.2.5. Other appearance models 2017a), and DeepSTRCF (Li et al., 2018a), the depth information used
Inspired by the primary visual cortex (area V1), Zhang et al. (2017a) in Chen et al. (2017a); the other is the theoretical innovation in
proposed a hierarchical structure of five layers appearance model and learning of filters such as MOSSE (Bolme et al., 2010), CCOT (Danelljan
performed the visual tracking within the particle filter framework. et al., 2016), CSR-DCF (Lukežič et al., 2017), ECO (Danelljan et al.,
Zhang et al. (2018d) represented the target object with the Boolean 2017a), DeepSTRCF (Li et al., 2018a), and MCPF (Zhang et al., 2017c),
maps that are generated by thresholding the HOG and color feature including the different regularization methods and optimization algo-
maps, where different granularities of Boolean maps capture multi- rithms.
scale connectivity cues of the target. In CNT (Zhang et al., 2016), a
two-layer convolutional network is utilized to extract object features 3.2.1. Single-channel filters
without offline training. However, CNT easily fails to localize the target In this section, we introduce the early DCF tracker that learned the
object when its appearance changes significantly due to motion blur or single-channel filter with single-channel features for tracking (Bolme
out-of-view. et al., 2010). In detail, our goal is to learn the correlation filter 𝐰 from
The appearance model of generative trackers plays an important an image patch 𝐱 of 𝑀 × 𝑁 pixels as in MOSSE (Bolme et al., 2010).
role, and great efforts have been put to improve the tracking perfor- All the circular shifts of 𝐱𝑚,𝑛 , (𝑚, 𝑛) ∈ {0, 1, … , 𝑀 − 1} × {0, 1, … , 𝑁 − 1}
mance. The success of deep learning in recent years, especially the are generated as training samples with Gaussian function label 𝑦(𝑚, 𝑛).
powerful representative ability, helps us to build a more robust and Then, we can find the minimizer 𝐰 by solving the following ridge
accurate appearance model. We will talk about the deep trackers in regression problem:
Section 5. ∑
min ‖𝐰𝑇 𝐱𝑖 − 𝐲𝑖 ‖2 + 𝜆‖𝐰‖2 (4)
𝐰
𝑖∈(𝑚,𝑛)
3. Discriminative trackers
where 𝜆 is the regularization parameter. The circular convolution in
For each frame of the sequence, the purpose of discriminative objective function (4) equals to the correlation operation in spatial
trackers (Wang et al., 2009; Henriques et al., 2012; Kalal et al., 2011; domain as:
Hua et al., 2015; Babenko et al., 2011; Nam and Han, 2016; Song et al.,
min ‖𝐰 ⋆ 𝐱 − 𝐲‖2 + 𝜆‖𝐰‖2 (5)
2018) is to learn a discriminative classifier that separates the target 𝐰
from the background. We summarized different types of discriminative where ⋆ denotes the circular correlation. The correlation operation
trackers according to their ways of model learning, as shown in Fig. 5. in the spatial domain equals the element-wise multiplication in the
In this section, we mainly discuss the conventional discriminative Fourier domain according to Parseval’s theorem. Therefore, the objec-
trackers that are not constructed based on deep neural networks as tive function (5) can be expressed as:
follows.
̂ 𝐱̂ ∗ − 𝐲‖
min ‖𝐰◦ ̂ 2 + 𝜆‖𝐰‖
̂ 2 (6)
𝐰̂
3.1. Trackers based on classifiers
where√ ̂ denote the Discrete Fourier Transform (𝐃𝐅𝐓) of a signal, such as
A popular kind of tracking approach is called tracking-by-detection, 𝐱̂ = 𝑇 𝐹 𝐱, and the constant matrix 𝐹 is the 𝐃𝐅𝐓 matrix. The ◦ denotes
in which a discriminative classifier is trained online to separate the the Hadamard product in the frequency domain, and ∗ denotes the
target object from the background. For example, FBT (Nguyen and complex conjugate. The filter 𝐰̂ in Eq. (6) can be computed efficiently
Smeulders, 2006) trained the foreground/background discrimination as 𝐰̂ = 𝐱̂ ∗ ◦𝐲∕(
̂ 𝐱̂ ∗ ◦𝐱̂ + 𝜆).
dynamically with texture features, MIL (Babenko et al., 2011) used the
Haar-like features (Viola and Jones, 2001) and Online-MILBoost (Zhang 3.2.2. Multi-channel filters
et al., 2006) algorithm to train a boosting classifier, and SST (Song To improve the tracking accuracy and robustness, there are works
et al., 2016) trained the classifier with the self-similarity learned fea- on multi-channel correlation filters based trackers. Different from the
tures and linear SVM. In addition, object detection has been signifi- single-channel filters that are learned with the single-channel feature
cantly improved with the development of machine learning techniques, (e.g., gray image), the multi-channel filters are learned with multi-
which further improves the performance of tracking methods. channel features (e.g., HOG features, RGB images, Haar-like features).
5
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508
Fig. 5. A summary of traditional discriminative trackers. This graph summarizes different kinds of traditional discriminative trackers, including classifiers, DCF based trackers, and
Spectral filter based trackers. Multi-channel filter based trackers with regularization technologies can be categorized into four kinds according to the type of regularization method.
MCCF (Kiani Galoogahi et al., 2013) extended single-channel filters and combining part-based models. We summarize these trackers as
of MOSSE (Bolme et al., 2010) to multi-channel filters, so as to be follows.
applied to multi-channel images (i.g., RGB images) or features (i.g.,
HOG). Staple (Bertinetto et al., 2016b) applied the HOG and color • Regularization Methods
features jointly to learn a model that is inherently robust to both the
In order to alleviate the boundary effects due to periodic assumption
shape deformation and color changes.
of samples, SRDCF (Danelljan et al., 2015b) and its variant Deep-
Apart from using different features, the kernel tricks (Henriques
SRDCF (Danelljan et al., 2015a) added a spatial weight function on
et al., 2015), multi-feature and multi-kernel (Tang and Feng, 2015; the filters to penalize the magnitude of the filter coefficients depending
Choi et al., 2016), scale adaptive kernel (Li and Zhu, 2014), long-term on the spatial locations. In DeepSTRCF (Li et al., 2018a), Feng et al.
tracking (Ma et al., 2015c), and multi-types of correlation filters (Ma introduced a temporal regularization term to SRDCF (Danelljan et al.,
et al., 2018) were proposed to improve the capability of DCF further. 2015b) with a single sample, in which ADMM (Boyd et al., 2011)
Henriques et al. (2015) employed kernelized correlation filters (KCF) is used to obtain a globally optimal solution instead of large linear
for real-time tracking. Tang and Feng (2015) extended KCF to multi- equations and Gauss–Seidel solver. CSCT (Fan et al., 2018) combined
kernel correlation filters (MKCF) with multiple features, in which the a dual-color clustering model and a novel spatio-temporal regulariza-
power-law (Dollár et al., 2014) is used to rescale the feature instead tion (Li et al., 2018a) to improve the robustness and discriminative
of rescaling the image patch to speed up the determination of the ability of the tracker. ASRCF (Dai et al., 2019) adaptively learned the
target scale during tracking. Inspired by the ensemble trackers (Avidan, spatial weight online based on BACF (Galoogahi et al., 2017b), which
2007; Wang and Yeung, 2014), Zhang and Suganthan (2017) proposed guides the tracker to learn more reliable filters. (See Fig. 6.)
a co-trained KCF (COKCF) tracker which consists of two KCF based sub- Conventional DCF trackers train the filters with examples generated
trackers (each with different kernel functions) to improve robustness in a small search region that contains very limited context information.
to complex backgrounding and significant appearance changes, where To obtain more real negative examples, CACF (Mueller et al., 2017) and
each of the KCF is associated with a different kind of feature (whether BACF (Galoogahi et al., 2017b) both considered the context information
the handcrafted feature or deep feature). ACFT (Ma et al., 2018) made for learning the filter. These two approaches apply different context
use of gradient-based features (HOG), intensity-based features (HOI), patch production strategies. CACF prepared the context patches before
and deep features to learn three kinds of correlation filters, including the training while BACF integrated the cropping operator into the
translation filter, scale filter, and long-term filter. objective function. However, the context patches produced by hand
Besides applying multi-channel feature maps to learn correlation context selection strategy in CACF are more limited than that in BACF.
filters, there are various techniques that can be incorporated into the In PAC (Zhang et al., 2018a), the Boolean maps (Zhang et al.,
learning process, such as learning filters in a continuous spatial domain, 2018d) and DCF with distractor-resilient metric regularization were
employing different regularizations, exploiting context information, combined to play the role of spatial selective attention and appearance
6
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508
Fig. 6. Visualization of the spatial regularization weights (a) in SRDCF (Danelljan et al., 2015b) and adaptive spatial regularization weights (b, c, d) in ASRCF (Dai et al., 2019),
and the corresponding image region used for training. The spatial region corresponding to the background features is assigned a large penalty in 𝑤 and vice versa. The spatial
regularization in SRDCF assigns similar values for locations that with equal distance to the sample center and the values are fixed during the tracking process, while the ASRCF
learns to change the spatial regularization weights according to different target objects.
tive attention describes the target and its scene, and the appearance Then, the Fourier coefficients of new convolution operator becomes
∑𝐷
selective attention enhances the discriminative capability of the filters. 𝑆̂
𝑃 𝑓 {𝑥} is
̂𝑑 𝑑 ̂
𝑑=1 𝑃 𝑓 𝑋 𝑏𝑑 , and we can reformulate Eq. (9) in the
Fourier domain as:
• Continuous Filters
‖ ‖2 ∑‖ 𝐷
‖2
𝐸(𝑓 , 𝑃 ) = ‖𝑧̂ 𝑇 𝑃 𝑓̂ − 𝑦̂‖ 2 + ‖𝜔̂ ∗ 𝑓̂𝑑 ‖ 2 + 𝜆 ‖𝑃 ‖2𝐹 (11)
The filter learning in conventional DCF is based on the assumption ‖ ‖𝓁 ‖ ‖𝓁
𝑑=1
that all feature channels must have the same spatial resolution. Most
trackers usually utilized the resampling strategy to make sure the all where 𝑧̂ 𝑑 [𝑘] = 𝑋 𝑑 [𝑘] 𝑏̂ 𝑑 [𝑘], and 𝜆 is the weight parameter of new added
feature channels have same resolution, while such a manner could regularization term.
inevitably introduce artifacts and the feature alignment problem. In
• Part-based Methods
order to efficiently fuse multi-resolution feature maps, Danelljan et al.
proposed a novel approach called CCOT (Danelljan et al., 2016). It ex- In order to improve the robustness of trackers against occlusion, a
tended conventional DCF to learn the correlation filters in a continuous number of part-based DCF trackers have been proposed recently. Liu
spatial domain, which enables the integration of multi-resolution fea- et al. (2015) divided the target region into several parts, in which
tures in an implicit way, instead of explicitly resampling the different KCF was applied on each part of the object. DRT (Sun et al., 2018a)
feature channels to the same resolution. developed the DCF by jointly exploiting the discrimination and relia-
The goal is to learn the convolution operator 𝑆𝑓 which corresponds bility information, where the element-wise product of a base filter and
a set of convolutional filters 𝑓 = (𝑓 1 , … , 𝑓 𝐷 ) ∈ 𝐿2 (𝑇 ), where 𝑓 𝑑 ∈
a reliability term are used to construct the correlation filter. By con-
𝐿2 (𝑇 ) is the continuous filter of channel 𝑑 in a continuous interval
sidering the influence of the occlusion, Wang et al. (2018a) proposed
[0, 𝑇 ). For the 𝑑th feature channel, the interpolation operator 𝐽𝑑 ∶
an occlusion-aware part-based tracking framework that combined the
R𝑁𝑑 → 𝐿2 (𝑇 ) is defined as:
global model and part-based model. To help further understand the
𝑁𝑑 −1 ( )
∑ 𝑇 core idea of the part-based model, we show two different dividing
𝐽𝑑 {𝑡} = 𝑥𝑑 [𝑛] 𝑏𝑑 𝑡 − 𝑛 (7) strategies in Fig. 8.
𝑛=0
𝑁𝑑
where 𝑏𝑑 is the interpolation function, 𝑁𝑑 is the dimension of the 3.3. Spectral filters
𝑑th feature channel of 𝑗th sample 𝑥𝑑𝑗 ∈ R𝑁𝑑 . Thus, we define the
convolutional operation as: Similar to DCF trackers that employ various features to learn cor-
∑
𝐷
{ } relation filters in the Fourier domain, Cui et al. (2019) represented
𝑆𝑓 {𝑥} = 𝑓 𝑑 ∗ 𝐽𝑑 𝑥𝑑 , 𝑥 ∈ (8) the pixel-wise spatial structure of the image region as an undirected
𝑑=1 weighted graph and learned the spectral filters on it. Based on graph
where ∗ denotes convolution in continuous domain. Given a set of , they proposed a Spectral Filter Tracking (SFT) tracking framework
training pairs {𝑥𝑗 , 𝑦𝑗 }𝑚
𝑗=1
the continuous filters can be learned by solving and defined a graph Laplacian operator = 𝐷 − 𝑊 , where 𝐷 is the
7
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508
Fig. 8. Example of part-based_model for tracking, where 𝑎 and 𝑏 are correspond to Wang et al. (2018a) and Sun et al. (2018a), respectively. Both of them divide the image
into multiple parts with different strategy. We can see that tracking process of DRT in 𝑏 produce a meaningful reliability map that indicates the weights of different parts. The
reliability map can be used to generate more discriminative filters for tracking.
where the 𝜔𝑖 denotes the weight of 𝑖th candidate state, and the 𝑆(𝜃) =
∑ ∑
diagonal degree matrix, and 𝐷𝑖𝑖 = 𝑗 𝑊𝑖𝑗 . The normalized graph can
1∕ 𝑖 𝜔𝑖 = 1 is utilized to normalize the sum of weighted entropy.
be computed as: MUSTer (Hong et al., 2015) applies long-term and short-term mem-
1 1 1 1
ory schemes for tracking, where the Integrated Correlation Filters (ICF)
𝑛𝑜𝑟𝑚 = 𝐷 2 𝐷− 2 = 𝐼 − 𝐷 2 𝑊 𝐷− 2 (12) are used for short-term tracking, and the local features (e.g., SIFT,
Given the input feature 𝐱, the frequency filtering on which can be Lowe, 2004) are used for keypoints matching in long-term tracking.
defined as 𝑧(𝜆 ̂ 𝑙 ), where 𝜆𝑙 is the 𝑙th spectrum of graph
̂ 𝑙 )𝑔(𝜆
̂ 𝑙 ) = 𝐱(𝜆 Zhao et al. (2016) combined the discriminative global and genera-
, and 𝑔(𝜆̂ 𝑙 ) is the spectrum filter that needed to be learned. This tive local appearance models to construct a more robust and discrim-
procedure is the same as the way of computing response in DCF. The inative tracker. The global representation was obtained by extracting
SFT operates on the local regions around a pixel (i.g., a vertex), which the feature from the color and texture of the target. The generative
can improve the robustness to local variations and occlusion. Please local model exploited the scale invariant feature transform and spatial
note that, in this method, filter function 𝑔(⊙)
̂ is approximated by the geometric information. During tracking, the global model and local
∑ model were integrated into the Bayesian approach to estimate the
Chebyshev polynomials as 𝑔(𝜆 ̂ 𝑙 ) = 𝐾−1 ̃
𝑘=0 𝜃𝑘 𝑇𝑘 (𝜆𝑙 ), where eigenvalues
{𝜆𝑙 } are scaled and shifted as 𝜆̃𝑙 = 2∕𝜆𝑚𝑎𝑥 𝜆𝑙 − 1 to make them fall in posterior probability as the Eq. (2) in Section 2.2.
[−1, 1]. (See Fig. 9.) In DLRT (Sui et al., 2018), both generative and discriminative
To summarize, DCF trackers are computationally efficient and are information are employed for target localization. The linear classifier
easy to be integrated with other models. However, DCF trackers have learned in the subspace is used to distinguish the target from the
the following limitations: (i) they tend to be affected by boundary neighboring background patches. The observation model in DLRT (Sui
effects, background clutter, and occlusion; (ii) they have constrained et al., 2018) for target localization is formulated as:
{ }
capacity for target scale estimating under severe non-rigid deformation; 1
𝑝(c|st ) ∝ exp − (‖1 − g(c; 𝐏, 𝐰, b)‖ + 𝜌𝛿(c; 𝐏)) (14)
(iii) degraded tracking speed when combined with the high dimen- 𝑙
sional deep features. which denotes the likelihood of a candidate c to be the target given
the state st , where 𝑔(c; ⋅) denotes the normalized reliability with linear
4. Collaborative trackers classifier defined by (𝐰, b). 𝛿(c; ⋅) denotes the reconstruction error with
the learned subspace 𝐏 in generative view.
From the previous sections, most trackers belong to either gener-
ative models or discriminative models. Yu et al. (2008), Zhong et al. 5. Trackers based on deep learning
(2012), Ma et al. (2015b), and Zhao et al. (2016) explore the fusion
strategies to combine the generative model and discriminative model With the growing popularity of using deep learning in a wide range
into a stronger tracker. of computer vision tasks, some researchers focus their attention on
8
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508
9
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508
Fig. 11. The flowchart of Siamese network based learning for tracking, where ⋆ denotes the correlation operation and BBR represents the Bounding Box Regression. In this kind
of tracker, the shared backbone network (e.g., VggNet, ResNet, and GoogLeNet) is used to extract feature representation for both the template and search branches. Then, the
response is calculated by the correlation operation between the features of the two branches. Most recent deep Siamese network based trackers (e.g., SiamRPN++, SiamBAN) employ
additional classification and BBR sub-networks to improve the robustness and accuracy of the tracking result. The final tracking result is determined by selecting the bounding
box with the highest classification score. Other ways of information interaction can also be applied such as global matching between graphs in Guo et al. (2021), cross-attention
in Chen et al. (2021) and Wang et al. (2021b).
10
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508
Fig. 13. The flowchart of ensemble learning for tracking (Qi et al., 2016). The feature
maps from different convolution layers construct the weak trackers. During tracking
process, the responses of weak trackers are weighed and summed to predict the target Fig. 15. The example of conv-LSTM module for learning of the target filter in Yang
position, where the weights of each weak trackers are updated by hedging algorithm. and Chan (2017). The conv-LSTM takes the exemplar feature 𝑒𝑡 as input and produces
the target object filter 𝑓𝑡 for tracking in the next frame. Both the initial hidden state
ℎ0 and cell state 𝑐0 are produced by feeding the initial exemplar feature map 𝑒0 of the
first frame into a convolutional layer.
11
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508
12
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508
Fig. 18. The examples of non-local attention modules, where (a) is the non-local attention used in HASiam (Yang et al., 2019b), (b) and (c) refer to Transformer modules for
fusing the template and search patches in Chen et al. (2021) and Wang et al. (2021b), respectively. The differences in (b) and (c) are that the latter encodes a history of template
features and their Gaussian-shaped masks in Decoder module, where the masks are used for Mask Transformation (top-right in (c)) and Feature Transformation (top-left in (c)).
13
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508
5.2.1. Using the pre-trained CNN model The average center location error (CLE) is the pixel-wise average
A common way in early deep trackers is to directly use the deep fea- Euclidean distance between the center locations of the target and the
tures extracted from a pre-trained CNN model on other tasks (e.g., im- manually labeled ground truth bounding boxes. Given the target’s
age classification). For example, the DeepSTRCF (Li et al., 2018a), ground truth location 𝑠𝑖 and the estimated location 𝑠̂𝑖 , the average
DeepSRDCF (Danelljan et al., 2015a), and CCOT (Danelljan et al., location error can be computed by:
2016) directly use multiple deep features extracted from a pre-trained
1 ∑
𝑁
model like VGG-m network (Chatfield et al., 2014), and combine 𝐶𝐿𝐸 = ‖𝑠 − 𝑠̂𝑖 ‖ (18)
𝑁 𝑖=1 𝑖
them with the hand-crafted features to represent the target, which can
significantly improve the tracking performance. where 𝑁 is the length of a sequence, ‖ ⋅ ‖ is the Euclidean dis-
tance. Similar to the pixel-wise CLE metric, the deviation in Smeulders
et al. (2014) computes the average center location errors between the
5.2.2. Offline learning methods predicted center and the ground truth center by:
To improve the discriminative ability of the tracker, most back- ∑ 𝑖 𝑖
𝑖∈𝑆 𝛿(𝑥𝑇 , 𝑥𝐺 ) 𝑥𝑖 , 𝑥𝑖
bone networks in deep trackers are trained in the offline phase, such 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 =1 − , 𝛿(𝑥𝑖𝑇 , 𝑥𝑖𝐺 ) = ‖ 𝑇 𝐺 ‖ (19)
|𝑆| 𝑠𝑖𝑧𝑒(𝑏𝑡𝐺 )
as SiamFC (Bertinetto et al., 2016c) and MDNet (Nam and Han,
2016). SiamFC (Bertinetto et al., 2016c) trained their network on where 𝑥𝑖𝑇 and 𝑥𝑖𝐺 are the centers of 𝑖th predicted and ground truth
ILSVRC (Russakovsky et al., 2015) dataset. In addition to employing bounding boxes in a set of frames, 𝑏𝑡𝐺 denotes the ground truth bound-
large scale datasets, such as Youtube-BB (Real et al., 2017) and ILSVRC ing box of the target in 𝑡th frame. Function 𝛿(𝑥𝑖𝑇 , 𝑥𝑖𝐺 ) is the Euclidean
in SiamRPN (Li et al., 2018c), the data augmentation techniques distance between two centroids, which is normalized by the size of the
(e.g., the flip, rotation, shift, stretching, blur, and dropout in Bhat et al., ground truth bounding box 𝑏𝑡𝐺 . |𝑆| denotes the number of successfully
2018; the distorting colors by adding image brightness, contrast, satu- tracked bounding boxes. The drawback of the above two metrics is
ration; hue in Yang and Chan, 2017) that expand the training datasets that such estimation may fail to reflect the performance of the tracking
also improve the network’s discriminative ability and prevent it from algorithms (Babenko et al., 2011) correctly. Instead, the precision
plots (Babenko et al., 2011; Wu et al., 2013) show the percentage of
over-fitting. In GOTURN (Held et al., 2016), the cropped image pairs
frames in which the estimated target locations are within the threshold
from still images are added to augment the training set and are useful to
distance of the ground truth, and can be defined as:
enhance the discriminative ability of the network. In DaSiamRPN (Zhu
et al., 2018a), the diverse categories of positive pairs collected from 𝑁𝜏
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = (20)
more complex datasets (i.g., ImageNet Detection, Russakovsky et al., 𝑁𝑓 𝑟𝑎𝑚𝑒𝑠
2015 and COCO Detection, Lin et al., 2014) and semantic negative pairs where 𝑁𝜏 denotes the number of successfully tracked frames that the
extracted from both the same categories and different categories are center location error is within a threshold 𝜏 (e.g., 20 pixels). 𝑁𝑓 𝑟𝑎𝑚𝑒𝑠
utilized to improve the tracker’s discriminative ability. denotes the number of total frames in a sequence.
14
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508
The precision metric only measures the localization accuracy and The Longest Subsequence Measure (LSM) proposed in TLP (Liu
does not reflect the scale of the target. Furthermore, the success plot et al., 2015) dataset aims to evaluate the long-term tracking per-
represents the percentage of frames for which the overlap between the formance. In detail, LSM computes the percentage of frames of the
estimated bounding box and the ground truth exceeds a threshold. Let longest successfully tracked continuous subsequence among the whole
𝑏𝑡 and 𝑏𝑔 denote the bounding boxes of the tracked target and ground sequence. For a tracking dataset with N sequences in total, the LSM at
truth respectively, we define the intersection-over-union (IoU) between the threshold 𝑥% can be computed as:
𝑏𝑡 and 𝑏𝑔 as:
1 ∑
𝑁
𝑅𝐿𝑆𝑀 (𝑥) = 𝑟 (𝑥) (23)
𝑏𝑡 ∩ 𝑏𝑔 𝑁 𝑖 𝑖
𝐼𝑜𝑈 = (21)
𝑏𝑡 ∪ 𝑏𝑔
where 𝑟𝑖 (𝑥) is the indicator function which got 1 if the ratio of the
where ∩ is the intersection of the two bounding boxes, and ∪ denotes length of the longest successful tracked subsequence no less than 𝑥%
the union of the two bounding boxes. Then the average overlap mea- and 0 otherwise. The parameter 𝑥 varies from 0 to 1. The 𝑅𝐿𝑆𝑀 (𝑥)
sure based on IoU can be defined as the mean of IoU among the whole denotes the ratio of successfully tracked sequences to the whole dataset,
sequence. Therefore, given a threshold 𝐴, the success of a sequence can where a sequence is successfully tracked if 𝑥% frames in it have IoU >
be computed as: 0.5. This metric reflects the ability of a tracker to continuously track
the target in a sequence without failure.
1 ∑
𝑁
𝑁𝐴
𝑅𝐴 = = 1 (𝑠 ) (22)
𝑁𝐺 𝑁 𝑖=1 𝐴 𝑖
6.6. F-score plot
here 𝑠𝑖 is the IoU between tracking result of 𝑖th frame and ground truth
bounding box. 1𝐴 (𝑠𝑖 ) is an indicator function which got 1 if 𝑠𝑖 below a
To take both the precision and recall measures into account, Lukežič
threshold 𝐴 and 0 otherwise.
et al. (2021) proposed F-score defined on precision and recall to evalu-
ate long-term tracking performance. In this metric, precision indicates
6.3. Robustness the accuracy of target absence prediction and recall reflects the target
re-detection capability of the tracker. Given a video sequence and its
ground truth bounding boxes of the target, the F-score can be computed
One pass evaluation (OPE), temporal robustness evaluation (TRE),
as:
and spatial robustness evaluation (SRE) are three evaluation metrics
proposed in Wu et al. (2015). OPE evaluates a tracker throughout a 2𝑃 𝑟(𝜏𝜃 )𝑅𝑒(𝜏𝜃 )
𝐹 (𝜏𝜃 ) = (24)
test sequence given the initialization (i.e., the bounding box of the 𝑃 𝑟(𝜏𝜃 ) + 𝑅𝑒(𝜏𝜃 )
target object) of the first frame, where different initialization states where 𝑃 𝑟(𝜏𝜃 ) and 𝑅𝑒(𝜏𝜃 ) are the precision and recall at the classification
would affect the tracking performance. Besides, TRE and SRE were threshold 𝜏𝜃 , respectively, each of which is defined as:
proposed (Wu et al., 2013) as alternatives to analyze the robustness 1 ∑
of a tracker under different initialization states, including spatial cir- 𝑃 𝑟(𝜏𝜃 ) = 𝛺(𝐴𝑡 (𝜏𝜃 ), 𝐺𝑡 )
𝑁𝑝
𝑡∈{𝑡∶𝐴𝑡 (𝜏𝜃 )≠∅}
cumstances (i.e., starting with different bounding boxes) and temporal (25)
1 ∑
circumstances (i.e., starting at different frames). Furthermore, SRE with 𝑅𝑒(𝜏𝜃 ) = 𝛺(𝐴𝑡 (𝜏𝜃 ), 𝐺𝑡 )
𝑁𝑔
restart (SRER) evaluates whether a tracker is sensitive to the spatial 𝑡∈{𝑡∶𝐺𝑡 ≠∅}
turbulence and OPE with restart (OPER) measures the performance
where 𝐴𝑡 (𝜏𝜃 ) denotes the predicted state at time-step 𝑡 with the 𝜏𝜃 be a
of the tracker with restarting when a tracking failure occurs. In the classification threshold, and 𝜃𝑡 is the predicted classification score. 𝐺𝑡
VOT tracking benchmark, the robustness (R) measures how many times is the corresponding ground truth target state. 𝛺(𝐴𝑡 (𝜏𝜃 ), 𝐺𝑡 ) computes
the tracker loses the target (fails) during tracking, where a failure is the intersection over union between the predicted state 𝐴𝑡 (𝜏𝜃 ) and the
detected when the overlap measure becomes zero. In VOT2015 (Kristan ground truth 𝐺𝑡 . In long-term tracking, the ground truth is empty if the
et al., 2015), both the average overlap and robustness (failures) are target is absent, i.g., 𝐺𝑖 = ∅. Similarly, the predicted is set as empty
considered to obtain the Expected Average Overlap (EAO). if the classification score is below a classification threshold 𝜏𝜃 , i.g.,
𝜃𝑡 < 𝜏𝜃 . 𝑁𝑝 denotes the number of frames where 𝐴𝑡 (𝜏𝜃 ) ≠ ∅. 𝑁𝑔 is the
6.4. Tracking length number of frames with 𝐺𝑡 ≠ ∅.
Tracking length (Čehovin et al., 2016) is another metric that com- 6.7. Tracking speed
putes the number of successfully tracked frames between the initialized
frame and its first failure which can be detected by center or overlap Tracking speed is an important criterion in tracking, especially in
measure given a predefined threshold 𝛾. This criterion may be suffered practical applications, which is expressed as the mean number of frames
from some difficult scenarios, including tracking failure due to sudden that can be processed by a tracker per second. Nevertheless, unlike
target shift, poor initialization due to low resolution, out-of-view, or accuracy and robustness that can be evaluated and compared fairly
heavy occlusion which happened close to the initialization frame. Once across different trackers within the same experiment and dataset, the
a tracking failure is detected, the tracker will stop working, and the rest tracking speed is determined by not only the computation of a tracking
of the sequence will be discarded. So, the failure rate (Čehovin et al., algorithm itself but also some other factors, such as implementation
2016) that measures the mean number of failures per sequence, which platforms (e.g., Windows, Linux, and Mac), programming languages
also was introduced in VOT2016 (Kristan et al., 2016) as a performance (e.g., C, C++, PYTHON, and MATLAB), hardware platforms that are
evaluation metric, can reflect the robustness of the tracker in the entire used to perform the experiments, and the deep learning frameworks
sequence. such as Caffe, Torch, Tensorflow, Julia, MXNet, and MatConvNet.
15
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508
uneven: spatial changes, unstable: temporal changes. 8.1. Quantitative evaluation on OTB datasets
16
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508
Table 2
Comparison of most popular tracking benchmarks in the literature.
Dataset Sequences Frames Mean frames Classes Resolution Visual attributes Shot/Long-term
OTB-100 (Wu et al., 2015) 100 59k 598 16 – 11 S
vot2015 (Kristan et al., 2015) 60 21455 357 20 – 11 S
vot2016 (Kristan et al., 2016) 60 21455 357 20 – 5 S
vot2018 (Kristan et al., 2018) 60 21356 356 24 – 5 S
tlp (Moudgil and Gandhi, 2017) 50 676k 13k 17 1280 × 720 6 L
uav123 (Mueller et al., 2016) 123 113k 915 9 – 12 S
alov300++ (Smeulders et al., 2014) 315 8936 483 – – 14 S
tcolor-128 (Liang et al., 2015) 129 55k 431 27 – 11 S
oxuva (Gavves et al., 2018) 366 1.55m 4.2k 22 – 6 L
ltb35 (Lukežič et al., 2018) 35 146k 4k 19 1280 × 720 ∼ 290 × 217 10 L
got-10k (Huang et al., 2018) 10k 1.5m 149 563 – 6 S
lasot (Fan et al., 2019) 1400 3.52m 2506 70 1280 × 720 14 L
trackingnet (Mueller et al., 2018) 30k 14m 471 27 – 15 S
nfs (Galoogahi et al., 2017a) 100 380k 3.8k 17 – 9 L
nUS-PRO (Li et al., 2016a) 365 109k 370 8 1280 × 720 12 S
Fig. 21. The overall performance of precision and success plots on the OTB-2013 (Wu et al., 2013), OTB-50 (Wu et al., 2015), and OTB-100 (Wu et al., 2015) datasets using
one-pass evaluation (OPE).
the supplementary material of Wu et al. (2015) proved that AOS equals However, the DCF trackers can still achieve comparable precision in
to AUC with enough uniformly sampled thresholds. situations like occlusion, out-of-view, and motion blur as shown in
As shown in Fig. 21, the deep trackers and the DCF trackers with Fig. 23.
deep features achieve high performance in terms of accuracy and Although VITAL (Song et al., 2018) achieves high precision of
robustness. The Siamese tracker STMTrack (Fu et al., 2021) shows con- 91.7% on OTB-100, ECO (Danelljan et al., 2017a) achieves much
siderable superiority over others in terms of robustness with the highest better robustness. In Fig. 22, ECO (Danelljan et al., 2017a) exhibits
AUC score of 71.9%. The ECO (Danelljan et al., 2017a) tracker, which
its robustness in most of the challenging scenarios, whereas it is more
employs the feature maps extracted from the VGG-m network (Chatfield
sensitive to the low resolution than SA-SIAM (He et al., 2018) and
et al., 2014), outperforms the ECO-HC (Danelljan et al., 2017a) that
VITAL (Song et al., 2018). Because the high-level deep features are
uses hand-crafted features (e.g., HOG and Color Names) by a large
robust to the appearance variety of objects, SiamR-CNN (Voigtlaender
margin. The adaptive spatial regularized DCF tracker ASRCF (Dai et al.,
et al., 2020) achieves the best robustness.
2019) still outperforms the top Siamese trackers and achieves the
highest precision of 92.2%. Both ECO (Danelljan et al., 2017a) and The overall performance on the OTB-100 dataset also indicates that
CCOT (Danelljan et al., 2016) can achieve competitive precision. We the dataset has become highly saturated over recent years. Both the
attribute this to more accurate and stable filters that are learned in AUC and precision have relatively small gap between the top trackers
the DCF tracker than those in the deep Siamese tracker. In Fig. 22, such as STMTrack (Fu et al., 2021), SiamGAT (Guo et al., 2021),
the robustness of state-of-the-art deep Siamese trackers (e.g., SiamBAN SiamR-CNN (Voigtlaender et al., 2020), and TrDiMP (Danelljan et al.,
and SiamR-CNN) surpasses others in a number of challenging scenarios. 2020).
17
F. Chen, X. Wang, Y. Zhao et al.
Table 3
Attributes of the benchmarks.
Dataset IV SV OCC DEF MB FM IPR OPR OV BC LR FOC VC CM POC ARC/LSV SOB/CON FBC ROT CS FL DL SHC MC S/LO SC SP TR MS MCO LC ZC LV LI
OTB-100 (Wu et al., 2015) ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
VOT2015 (Kristan et al., 2015) ✓ ✓ ✓ ✓ ✓
VOT2016 (Kristan et al., 2016) ✓ ✓ ✓ ✓ ✓
VOT2018 (Kristan et al., 2018) ✓ ✓ ✓ ✓ ✓
TLP (Moudgil and Gandhi, 2017) ✓ ✓ ✓ ✓ ✓ ✓ ✓
18
Fig. 22. The success plots of each attribute on OTB-100 (Wu et al., 2015), including background clutter, deformation, illumination variation, in-plane rotation, low resolution, out
of view, motion blur, occlusion, out-of-plane rotation, scale variation, and fast motion.
8.2. Quantitative evaluation on VOT datasets more sequences with non-rigid object deformation. In Table 4, the deep
trackers like SiamAttn (Yu et al., 2020) and OceanOn (Zhang et al.,
In this section, we evaluate state-of-the-art trackers on three VOT 2020) achieve high performance compared to DCF trackers. The high
datasets by vot-toolkit (Kristan et al., 2017). The results of three metrics performance benefits from deeper backbone networks and discrimi-
that including Expected Average Overlap (EAO), Accuracy Value (Av),
native detection module, for example, the multi-level RPN modules
and Robustness Value (Rv) are shown in Table 4. Each of the VOT
in SiamAttn (Yu et al., 2020), the bounding box adaptive network
datasets contains 60 sequences. VOT2015 and VOT2016 contain the
in SiamBAN (Chen et al., 2020), and the segmentation prediction
same sequences, where the main difference is that the ground truth
bounding boxes in VOT2016 are more accurate than those in VOT2015. network in D3S (Lukežič et al., 2020). Ocean significantly outperforms
In VOT2018, the least challenging sequences from VOT2016 are re- SiamRPN++ with a large gain of 5.2% in EAO due to the object-
placed by new sequences. EAO is introduced in VOT2015 to unify aware features that combined with the regular-region features from
the accuracy and robustness of the trackers. The VOT datasets contain convolution improve the reliability of the classification network.
19
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508
Fig. 23. The precision plots of each attribute on OTB-100 (Wu et al., 2015), including background clutter, deformation, illumination variation, in-plane rotation, low resolution,
out of view, motion blur, occlusion, out-of-plane rotation, scale variation, and fast motion.
The conventional trackers remain a large gap to recently deep track- other trackers by a large margin. Among DCF trackers, DRT (Sun et al.,
ers, especially on videos with severe appearance deformation. Among 2018a), which combines multiple deep features (i.g., conv1 from VGG-
the DCF trackers, DeepSTRCF (Li et al., 2018a), CSR-DCF (Lukežič m, conv4-3 from VGG-16, Chatfield et al., 2014) and hand-crafted
et al., 2017), and DeepSRDCF (Danelljan et al., 2015a) mainly fo- features (i.g., HOG and Color Names) to represent target object and
cus on resolving the boundary effects with different regularization introduces the base filter and reliability learning mechanism, achieves
methods. DeepSTRCF (Li et al., 2018a) surpasses its counterparts Deep- the best EAO and robustness scores on VOT2018. The robustness scores
SRDCF (Danelljan et al., 2015a) and CSR-DCF (Lukežič et al., 2017)
illustrate the effectiveness of the basis filter introduced in DRT (Sun
by 3.2% and 3.1% in terms of accuracy on VOT2018. In terms of
et al., 2018a) in learning the discriminative and reliable representation
robustness, DeepSTRCF reduces the failure rate by 14.1% and 49.2%
of the target.
compared to CSR-DCF (Lukežič et al., 2017) and DeepSRDCF (Danelljan
et al., 2015a) on VOT2018, respectively. It illustrates that both the However, the high dimensional deep features used in ECO (Danell-
spatial and temporal regularizations are beneficial in improving the jan et al., 2017a), CFCF (Gundogdu and Alatan, 2018), and DRT (Sun
accuracy and robustness of trackers. et al., 2018a) led to inferior tracking speed than SiamRPN (Li et al.,
Among deep trackers, the accuracy of SiamRPN++ (Li et al., 2019b) 2018c) and DaSiamRPN (Zhu et al., 2018a), as we reported in Table 7.
outperforms SiamRPN (Li et al., 2018c) and DaSiamRPN (Zhu et al., The diverse categories of positive and semantic negative pairs in DaSi-
2018a) by 11% and 3% on VOT2018, respectively, and outperforms amRPN (Zhu et al., 2018a) enhance the tracker’s discriminative ability
20
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508
Table 4
Performance comparison among the state-of-the-art trackers on VOT 2015, VOT 2016, and VOT 2018. The results are presented in terms of EAO, Av and Rv.
The top three results are marked in red, blue, and green fonts, respectively.
VOT2015 VOT2016 VOT2018
Trackers Av Rv EAO Av Rv EAO Av Rv EAO
SiamAttn (Yu et al., 2020) – – – 0.680 0.140 0.537 0.630 0.160 0.470
STMTrack (Fu et al., 2021) – – – 0.629 0.149 0.468 0.590 0.159 0.447
SiamRN (Cheng et al., 2021) – – – – – – 0.595 0.131 0.470
TrDiMP (Wang et al., 2021b) 0.666 0.121 0.490 0.617 0.131 0.480 0.600 0.141 0.462
OceanOn (Zhang et al., 2020) – – – – – – 0.592 0.117 0.489
Ocean (Zhang et al., 2020) – – – – – – 0.598 0.169 0.467
SiamRPN++ (Li et al., 2019b) 0.654 0.201 0.464 0.637 0.178 0.478 0.600 0.234 0.415
SiamR-CNN (Voigtlaender et al., 2020) 0.676 0.201 0.451 0.645 0.173 0.461 0.612 0.220 0.405
SiamBAN (Chen et al., 2020) 0.655 0.136 0.495 0.632 0.150 0.505 0.590 0.178 0.447
PrDiMP50 (Danelljan et al., 2020) 0.680 0.140 0.489 0.652 0.140 0.476 0.618 0.165 0.442
D3S (Lukežič et al., 2020) 0.605 0.150 0.421 0.611 0.131 0.458 0.591 0.159 0.459
SiamFC++ (Xu et al., 2020) – – – – – – 0.587 0.183 0.426
SPM-Tracker (Wang et al., 2019a) – – – 0.62 0.21 0.434 0.58 0.3 0.338
ATOM (Danelljan et al., 2019) 0.641 0.185 0.434 0.617 0.190 0.424 0.590 0.201 0.401
DiMP50 (Bhat et al., 2019) 0.643 0.159 0.452 0.624 0.136 0.479 0.597 0.152 0.44
ASRCF (Dai et al., 2019) 0.581 0.286 0.318 0.568 0.187 0.390 0.492 0.234 0.328
DeepSTRCF (Li et al., 2018a) 0.591 0.295 0.321 0.569 0.248 0.341 0.523 0.215 0.345
SA-SIAM (He et al., 2018) 0.594 0.342 0.313 0.543 0.337 0.291 0.5 0.459 0.236
DRT (Sun et al., 2018a) 0.556 0.169 0.389 0.557 0.173 0.390 0.519 0.201 0.356
CSR-DCF (Lukežič et al., 2017) 0.563 0.262 0.320 0.524 0.239 0.338 0.491 0.356 0.256
CFCF (Gundogdu and Alatan, 2018) 0.578 0.239 0.336 0.560 0.169 0.384 0.511 0.286 0.283
CCOT (Danelljan et al., 2016) 0.544 0.243 0.303 0.541 0.239 0.331 0.494 0.318 0.267
SiamRPN (Li et al., 2018c) 0.604 0.262 0.358 0.578 0.314 0.340 0.490 0.464 0.244
DaSiamRPN (Zhu et al., 2018a) 0.639 0.183 0.446 0.609 0.225 0.401 0.570 0.337 0.326
ECO (Danelljan et al., 2017a) 0.570 0.310 0.314 0.555 0.201 0.374 0.484 0.276 0.281
ECO_HC (Danelljan et al., 2017a) 0.563 0.361 0.280 0.542 0.304 0.322 0.494 0.435 0.238
DSST (Danelljan et al., 2017b) 0.549 0.763 0.172 0.535 0.707 0.181 0.395 1.452 0.079
LSART (Sun et al., 2018b) 0.580 0.196 0.371 0.495 0.215 0.323 0.495 0.276 0.323
MEEM (Zhang et al., 2014c) 0.503 0.501 0.221 0.490 0.515 0.194 0.463 0.534 0.193
Staple (Bertinetto et al., 2016b) 0.580 0.375 0.300 0.547 0.379 0.295 0.530 0.688 0.169
SRDCF (Danelljan et al., 2015b) 0.564 0.332 0.288 0.536 0.421 0.247 0.490 0.974 0.119
DeepSRDCF (Danelljan et al., 2015a) 0.568 0.281 0.318 0.507 0.326 0.276 0.492 0.707 0.154
SiamFC (Bertinetto et al., 2016c) 0.530 0.880 0.290 0.532 0.461 0.88 0.503 0.585 0.188
SAMF (Li and Zhu, 2014) 0.532 0.585 0.202 0.507 0.590 0.186 0.484 1.302 0.093
KCF (Henriques et al., 2015) 0.486 0.670 0.167 0.491 0.571 0.192 0.447 0.773 0.135
DSiam (Guo et al., 2017) 0.541 0.280 – – – – 0.512 0.646 0.196
IVT (Ross et al., 2008) 0.444 1.152 0.122 0.420 1.114 0.115 0.400 1.638 0.075
MIL (Babenko et al., 2011) 0.423 0.735 0.171 0.408 0.726 0.115 0.165 1.011 0.118
21
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508
22
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508
Fig. 26. The success plots on LaSOT (Fan et al., 2019) dataset for each attribute, including aspect ratio change, background clutter, camera motion, deformation, fast motion, full
occlusion, illumination variation, low resolution, motion blur, out-of-view, partial occlusion, rotation, scale variation, and viewpoint change.
cross-correlation layer and estimates the scale in and discrete scale SiamFC++-GoogLeNet (Xu et al., 2020) employ the anchor-free based
space. However, SiamRPN (Li et al., 2018c) uses a Region Proposal bounding box regressor to adapt to scale and aspect ratio changes. In
Network (RPN) which combines a classification branch for target loca- addition, Ocean (Zhang et al., 2020) proved that the features trans-
tion and a bounding box regression branch for refining the proposals. formed from object region can contribute to the discriminative ability
These two separate branches lead to a significant improvement over of the classification branch.
SiamFCv2 (Valmadre et al., 2017), with a gain of 8.9% in the AO Inspired by advanced object detectors, centerness (Xu et al., 2020)
metric. SiamBAN (Chen et al., 2020), Ocean (Zhang et al., 2020), and and IoU (Jiang et al., 2018) are two ways to evaluate the quality of
23
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508
Fig. 27. The precision plots on LaSOT (Fan et al., 2019) dataset for each attribute, including aspect ratio change, background clutter, camera motion, deformation, fast motion,
full occlusion, illumination variation, low resolution, motion blur, out-of-view, partial occlusion, rotation, scale variation, and viewpoint change.
the proposals and can be combined with the classification branch for However, SiamGAT used the target-aware template and they compute
selecting the best proposal. The centerness feature map can be used the correlation scores between the target template and search region
to reweight the quality of the proposals and be combined with the within a complete bipartite graph, improving the accuracy of target
classification branch in SiamRPN (Li et al., 2018c). SiamGAT (Guo localization when the shape and pose of the target are changed dras-
et al., 2021) and SiamCAR (Guo et al., 2020) apply the identical head tically. Specifically, they project the bounding box onto the template
networks that consist of a classification branch and a bounding box feature map to select a region of interest that is more accurate than the
regression branch except for the centerness branch used in SiamCAR. fixed region in SiamRPN.
24
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508
Fig. 28. Qualitative results of the challenging sequences in LaSOT. From top to bottom, the sequences are 𝑎𝑖𝑟𝑝𝑙𝑎𝑛𝑒-15, 𝑏𝑒𝑎𝑟-4, 𝑏𝑖𝑐𝑦𝑐𝑙𝑒-18, 𝑏𝑢𝑠-2, 𝑔𝑎𝑚𝑒𝑡𝑎𝑟𝑔𝑒𝑡-2, 𝑔𝑖𝑟𝑎𝑓 𝑓 𝑒-2, 𝑔𝑜𝑙𝑑𝑓 𝑖𝑠ℎ-3,
𝑚𝑜𝑡𝑜𝑟𝑐𝑦𝑐𝑙𝑒-1, 𝑘𝑖𝑡𝑒-6, and 𝑠𝑢𝑟𝑓 𝑏𝑜𝑎𝑟𝑑-5.
8.5. Quantitative evaluation on TrackingNet dataset boxes of TrackingNet capture the target region as much as possible.
We report the detailed comparison results of state-of-the-art trackers on
TrackingNet (Mueller et al., 2018) contains more than 30k videos TrackingNet in Table 6. As shown in Table 6, the TransT (Chen et al.,
with more than 14 million annotated frames in total. This dataset has 2021) tracker yields the highest success score of 82.06%, precision
been split into a training set with 30,132 training videos from Youtube- score of 80.64%, and normalized precision score of 87.09%. Compared
BoundingBoxes (YT-BB) (Real et al., 2017) and a test set with 511 with SiamR-CNN (Voigtlaender et al., 2020) that uses a re-detection
testing videos from YouTube with a Creative Commons license (YT- scheme, TransT improves the performance by 0.86% in success, 0.64%
CC). The test set was selected with a similar distribution to the training in precision, and 1.69% in normalized precision, respectively. The
set. The whole dataset contains 14,431,266 frames. The annotations recent deep trackers illustrate their strong generalization ability to
for the test set are not visible to researchers except for the initial target variations than the previous trackers. Note that SiamRPN++ (Li
frame of each sequence and all the trackers are evaluated through an et al., 2019b) outperforms D3S (Lukežič et al., 2020) by 0.5%. This
online server in terms of success, precision, and normalized precision. improvement of SiamRPN++ is partially due to its pretraining on the
The success and precision are similar to the OTB-100 dataset. The training set of the TrackingNet. Benefiting from the online target model
normalized precision is obtained by normalizing the original precision adaption with the steepest descent (Wright et al., 1999) method and
over the size of the ground truth bounding box. In addition, more the discriminative learning loss, DiMP50 obtains a better AUC score of
than 30% sequences in OTB-100 (Wu et al., 2015) are annotated with 74.0% compared to SiamRPN++. However, SiamRPN++ obtains a gain
bounding boxes that have a constant aspect ratio, and such annotations of 1% in AUC with multiple features of a deeper backbone network (i.e,
in fact could not reflect the exact target region. While the ground truth ResNet50).
25
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508
Table 6
Comparison results on TrackingNet dataset with performance measures of success,
precision, and normalized precision (NormPrec). The top three results are marked in
red, blue, and green fonts, respectively. SiamFC𝑜𝑟𝑖 and SiamFC𝑓 𝑡 denote the original
model and the model fine-tuned on TrackingNet, respectively. LightTrack-LargeA and
LightTrack-LargeB are two searched trackers under different resource constraints. Please
refer to Yan et al. (2021) for the details.
Trackers Success ↑ Precision ↑ NormPre ↑
TransT (Chen et al., 2021) 82.1 80.7 87.1
SiamR-CNN (Voigtlaender et al., 2020) 81.2 80.0 85.4
STMTrack (Fu et al., 2021) 80.3 76.7 85.1
TrDiMP (Wang et al., 2021b) 78.4 73.1 83.3
TrSiam (Wang et al., 2021b) 78.1 72.7 82.9
SiamGAT (Guo et al., 2021) 75.3 69.7 80.7
LightTrack-LargeA (Yan et al., 2021) 73.6 70.0 78.8
LightTrack-LargeB (Yan et al., 2021) 73.3 70.8 78.9
SiamCAR (Guo et al., 2020) 74.0 68.4 80.4
FCOS-MAML (Wang et al., 2020) 67.7 62.5 75.2
Retain-MAML (Wang et al., 2020) 75.7 – 82.2
CGACD (Du et al., 2020) 71.1 69.3 80.0
SiamAttn (Yu et al., 2020) 75.2 – 81.7
D3S (Lukežič et al., 2020) 72.8 66.4 76.8
KYS (Bhat et al., 2020) 74.0 68.8 80.0
Fig. 29. The overall performance of success rate plots on the GOT-10k dataset using PrDiMP50 (Danelljan et al., 2020) 75.8 70.4 81.6
one-pass evaluation (OPE). PrDiMP18 (Danelljan et al., 2020) 75.0 69.1 80.3
SiamFC++-GooLeNet (Xu et al., 2020) 75.4 70.5 80.0
Table 5 SiamFC++-AlexNet (Xu et al., 2020) 71.2 75.8 64.6
Comparison results on GOT-10k dataset with performance measures of average overlap SiamBAN (Chen et al., 2020) 72.3 68.5 79.4
(AO) and success rate (SR). The SR0.5 denotes the success rates where the overlaps DCFST-50 (Zheng et al., 2020) 75.2 70.0 80.9
exceed 50%, and SR0.75 measures the success rates over the frames where the overlaps DiMP50 (Bhat et al., 2019) 74.0 68.7 80.1
exceed 0.75%. The top three results are marked in red, blue, and green fonts, DiMP18 (Bhat et al., 2019) 72.3 66.6 78.5
respectively. SiamRPN++ (Li et al., 2019b) 73.3 69.4 80.0
SPM-Tracker (Wang et al., 2019a) 71.2 66.1 77.8
Trackers AO ↑ SR0.5 ↑ SR0.75 ↑
ATOM (Danelljan et al., 2019) 70.3 64.8 77.1
TransT (Chen et al., 2021) 0.723 0.824 0.682 C-RPN (Fan and Ling, 2019) 66.9 61.9 74.6
TrDiMP (Wang et al., 2021b) 0.671 0.777 0.583 DaSiamRPN (Zhu et al., 2018a) 63.8 59.1 73.3
TrSiam (Wang et al., 2021b) 0.660 0.766 0.571 UPDT (Bhat et al., 2018) 61.1 55.7 70.2
SiamR-CNN (Voigtlaender et al., 2020) 0.649 0.728 0.597 MDNet (Nam and Han, 2016) 60.6 56.5 70.5
STMTrack (Fu et al., 2021) 0.642 0.737 0.579 SiamFC𝑜𝑟𝑖 (Bertinetto et al., 2016c) 57.1 53.3 66.3
KYS (Bhat et al., 2020) 0.636 0.751 0.515 SiamFC𝑓 𝑡 (Bertinetto et al., 2016c) 58.1 54.3 67.3
PrDiMP50 (Danelljan et al., 2020) 0.634 0.738 0.543 CFNet (Valmadre et al., 2017) 57.8 53.3 65.4
SiamGAT (Guo et al., 2021) 0.627 0.743 0.488 CSR-DCF (Lukežič et al., 2017) 53.4 48.0 62.2
LightTrack-LargeB (Yan et al., 2021) 0.623 0.726 – Staple (Bertinetto et al., 2016b) 52.8 47.0 60.5
DiMP50 (Bhat et al., 2019) 0.611 0.717 0.492 Staple𝐶𝐴 (Mueller et al., 2017) 52.9 46.8 60.5
OceanOn (Zhang et al., 2020) 0.611 0.721 0.473 ECO (Danelljan et al., 2017a) 55.4 49.2 61.8
SiamFC++-GoogLeNet (Xu et al., 2020) 0.607 0.737 0.464 ECO𝐻𝐶 (Danelljan et al., 2017a) 54.1 60.8 47.6
D3S (Lukežič et al., 2020) 0.597 0.676 0.462 BACF (Galoogahi et al., 2017b) 52.3 46.1 58.0
Ocean (Zhang et al., 2020) 0.592 0.695 0.465 SRDCF (Danelljan et al., 2015b) 52.1 45.5 57.3
SiamCAR (Guo et al., 2020) 0.579 0.677 0.436 KCF (Henriques et al., 2015) 44.7 41.9 54.6
SiamBAN (Chen et al., 2020) 0.549 0.651 0.404 DSST (Danelljan et al., 2014) 46.4 46.0 58.8
ATOM (Danelljan et al., 2019) 0.556 0.634 0.402 Struck (Hare et al., 2016) 45.6 40.2 53.9
SiamRPN++ (Li et al., 2019b) 0.510 0.606 0.316
SiamRPN (Li et al., 2018c) 0.463 0.549 0.253
SiamFCv2 (Valmadre et al., 2017) 0.374 0.404 0.144
SiamFC (Bertinetto et al., 2016c) 0.348 0.353 0.098
GOTURN (Held et al., 2016) 0.347 0.375 0.124 by minimizing the KL-divergence) in Target Center Regression (TCR)
CCOT (Danelljan et al., 2016) 0.325 0.328 0.107 and Bounding Box Regression (BBR) is the most crucial component
ECO (Danelljan et al., 2017a) 0.316 0.309 0.111
MDNet (Nam and Han, 2016) 0.299 0.303 0.099 that influences the capacity of the tracker. The DCF based tracker
CFNet_conv2 (Valmadre et al., 2017) 0.293 0.265 0.087 UPDT (Bhat et al., 2018) improves the baseline ECO by combining
ECOhc (Danelljan et al., 2017a) 0.286 0.276 0.096 output scores of deep (e.g., ResNet50) and shallow (e.g., HOG and
CFNet_conv5 (Valmadre et al., 2017) 0.270 0.225 0.072
CFNet_conv1 (Valmadre et al., 2017) 0.261 0.243 0.084 ColorNames) features with an adaptive fusion strategy. The KYS (Bhat
BACF (Galoogahi et al., 2017b) 0.260 0.262 0.101 et al., 2020) tracker shares the backbone network (i.e., ResNet50) and
DSST (Danelljan et al., 2014) 0.247 0.223 0.081 bounding box regressor (i.e., IoU-Net), we observe that there is no
Staple (Bertinetto et al., 2016b) 0.246 0.239 0.089
SRDCF (Danelljan et al., 2015b) 0.236 0.227 0.094
significant improvement in the test set of TrackingNet when compared
fDSST (Danelljan et al., 2017b) 0.206 0.187 0.075 to the baseline tracker DiMP50. The fine-tuned version of SiamFC𝑓 𝑡
KCF (Henriques et al., 2015) 0.203 0.177 0.065 improves its baseline 𝑆𝑖𝑎𝑚𝐹 𝐶𝑜𝑟𝑖 by 1% in terms of both success and
precision, which illustrates that fine-tuning some deep trackers on the
training set of TrackingNet can improve the generality of the test set.
Moreover, we find that gap between the PrDiMP18 (Danelljan In the above subsections, we have presented the tracking results
et al., 2020) and PrDiMP50 (Danelljan et al., 2020) is relatively small on five kinds of popular datasets. The discriminative deep trackers
(e.g., 0.8% in AUC and 1.3% in Precision) when applied ResNet18 have shown superior performance compared to other DCF or generative
and ResNet50 backbone networks respectively. In addition, PrDiMP18 trackers. Because of the deep backbone network, discriminative classi-
outperforms DiMP18 by 2.7% in Success and 2.5% in Precision re- fiers, object detectors, and large-scale training datasets, we can develop
spectively, demonstrating that the probabilistic regression (trained a more powerful model during offline training and online tracking.
26
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508
Fig. 30. Qualitative examples: tracking results of the challenging sequences in OTB2015 on 8 video sequences, including 𝐵𝑖𝑟𝑑1, 𝐽 𝑜𝑔𝑔𝑖𝑛𝑔-2, 𝐶𝑎𝑟4, 𝑀𝑜𝑡𝑜𝑟𝑅𝑜𝑙𝑙𝑖𝑛𝑔, 𝑆𝑘𝑎𝑡𝑖𝑛𝑔2-1, 𝑆𝑘𝑎𝑡𝑖𝑛𝑔1,
𝑆𝑘𝑖𝑖𝑛𝑔, and 𝑆𝑜𝑐𝑐𝑒𝑟. The ground truth bounding boxes are in white.
In this section, we first summarize the state-of-the-art trackers In visual object tracking, developing a robust and accurate tracker
including their tracking methods, backbone networks, representation is still challenging due to a vast number of occlusions caused by distrac-
schemes, training datasets, classification/regression methods, training tors or non-semantic backgrounds. Although the DCF based frameworks
schemes, update schemes (no-update, linear-update, non-linear up- have achieved great success due to their high efficiency of computation,
date), tracking speed, published year, implementation platform especially by combining with the deep learning methods, their capacity
(GPU/CPU), re-detection (yes/no, i.e., Y/N), and report what kinds of to deal with object occlusion still needs to be improved. In early works,
tracking frameworks they utilized, including discriminative correlation sparse representations (Mei et al., 2011; Zhang et al., 2012b; Ji et al.,
filter (DCF), particle filter (PF), deep learning (DL), and tracking- 2012; Zhang et al., 2015; Sui et al., 2015) and part-based appearance
by-detection (T&D), as shown in Table 7. The ‘Representation’ col- models (Liu et al., 2015; Wang et al., 2018a) were proposed to deal
umn summarizes the features that trackers used including HOG, Color with the occlusion problem. The sparse trackers model the appearance
of a target with a linear combination of a template set, which can be
Names, PCA-HOG, CIE LAB color feature, HOI (histogram of local inten-
dynamically updated according to the appearance changes (e.g., partial
sities), BMR (Boolean map representation), LBP (local binary pattern),
or heavy occlusion). However, most sparse trackers are performed
intensity, edge, raw pixels, grayscale image, and different features from
within the particle filter framework, where a sufficient number of
CNN. The ‘Backbone’ column presents the neural networks used for
particles are needed to be sampled to guarantee the good performance
feature representation and target modeling, such as AlexNet, CaffeNet,
of the appearance model. Although we can improve the robustness of
VGG-M, VGG-16, VGG-19, ResNet-18/50, and GoogLeNet. The ‘Train-
a tracker with more particles, the high computation cost will affect the
ing Data’ column summarizes the datasets used for training the deep
tracking speed. Furthermore, the part-based trackers model the target
trackers, including ILSVRC2015 VID, ILSVRC2015 DET, Youtube-BB,
by multiple parts to improve the robustness against partial occlusion.
GOT-10k, LaSOT, COCO, TrackingNet, TC-128, Cityperson, WilderFace,
Recently, CPF (Zhang et al., 2018c) and MCPF (Zhang et al., 2017c)
VOC2012, OTB-100, VOT2013, VOT2014, VOT2015, and ALOV300++.
utilized multiple correlation filters (MCF) to adapt the appearance
The ‘Training Scheme’ column summarizes the objective functions and
model to appearance variations. By applying MCF to the particle sam-
training methods of different trackers for further understanding of
pling process, the number of particles can be decreased significantly,
the tracking algorithms. For example, binary cross-entropy is used for
and each of the particles can be shepherded towards a more accurate
training the classification branch, and IoU loss or 𝐿1 loss are used for target location. However, as shown in Figs. 22 and 23, ECO (Danelljan
training the bounding box regression branch. The ‘Localization Method’ et al., 2017a), CCOT (Danelljan et al., 2016), and CFCF (Gundogdu and
column summarizes how the trackers perform target localization and Alatan, 2018) perform better than MCPF (Zhang et al., 2017c) for all
scale estimation during tracking. The ‘Update’ column summarized how attributes except the low resolution and in-plan rotation at precision
the model is updated during tracking. The strategies for the model score.
update can be classified into three categories: (1) no update: The model A re-detection scheme was utilized in TRACA (Choi et al., 2018)
is not updated during tracking, (2) linear update: The model is linearly to deal with the full occlusion scenario, which is determined by the
updated with the EMA (exponential moving average strategy) such as difference between maximal response values of adjacent frames. More-
MOSSE (Bolme et al., 2010), (3) meta-update: The model is updated over, other similarity comparison algorithms can also detect the full
in the meta-learning framework during tracking, (4) nonlinear-update: occlusion or the loss of the target. A phenomenon in occlusion scenario
The model is updated online as a nonlinear process. is that state-of-the-art deep trackers like SiamBAN (Chen et al., 2020)
We then discuss the ten most challenging scenarios including oc- and SiamRPN++ (Li et al., 2019b) have lost the target due to severe oc-
clusion, illumination, motion blur, deformation, scale variation, out- clusion in the 435th frame on sequence 𝑆𝑘𝑎𝑡𝑖𝑛𝑔2-1 in Fig. 30. While the
of-view, background clutter, fast motion, rotation, and low resolution DCF trackers like ECO (Danelljan et al., 2017a) and CFCF (Gundogdu
for the trackers mentioned in Table 7, according to their tracking and Alatan, 2018) still have tracked the target successfully, which illus-
performance. Finally, we discuss the model update, motion model, trates the weakness of some deep Siamese trackers (e.g., SiamRPN++)
hyper-parameter tuning, and tracking speed. in dealing with occlusion. A potential reason for such failure in these
27
F. Chen, X. Wang, Y. Zhao et al.
Table 7
Summary of state-of-the-art trackers with different attributes. C-Classification, R-Regression, SGD-Stochastic Gradient Descent, ADMM-Alternating Direction Method of Multipliers, BPTT-Backpropagation Through Time, APG-Accelerated
Proximal Gradient, MSE-Mean Square Error, CN-Color Names, BBR-Bounding Box Regression, CG-Conjugate Gradient.
Trackers Frameworks Backbone Representation Training data Localization method Training scheme Update FPS GPU Re- Year
detection
SiamRN (Cheng et al., DL ResNet-50 Block-3,4,5 ILSVRC2015 VID, binary classification, binary cross-entropy no-update – Y N 2021
2021) Youtube-BB, GOT-10k, BBR loss+IoU loss+MSE
LaSOT, ILSVRC2015 DET loss+SGD
SiamGAT (Guo et al., DL GoogLeNet Inceptionv3 ILSVRC2015 VID, binary classification, binary cross-entropy no-update 70 Y N 2021
2021) Youtube-BB, GOT-10k, BBR loss+IoU loss+SGD
COCO, ILSVRC2015 DET
STMTrack (Fu et al., 2021) DL GoogLeNet Inceptionv3 ILSVRC2015 VID, binary classification, focal loss+binary linear-update 37 Y N 2021
ILSVRC2015 DET, GOT-10k, BBR cross-entropy loss+IoU
COCO, LaSOT, TrackingNet loss+SGD
LightTrack (Yan et al., DL supernet – ILSVRC2015 VID, binary classification, binary cross-entropy no-update – Y N 2021
2021) ILSVRC2015 DET, COCO, BBR loss+IoU loss+SGD
GOT-10k, Youtube-BB
TransT (Chen et al., 2021) DL ResNet-50 Block-4 COCO, TrackingNet, LaSOT, binary classification, giou loss+𝐿1 no-update 50 Y N 2021
GOT-10k BBR loss+cross-entropy
loss+AdamW
SiamCAR (Guo et al., DL ResNet-50 Block-3,4,5 COCO, ILSVRC2015 VID, binary classification, binary cross-entropy no-update 52 Y N 2020
2020) ILSVRC2015 DET, BBR loss+IoU
Youtube-BB, GOT-10k, loss+centerness
LaSOT loss+SGD
28
TrDiMP (Wang et al., DL ResNet-50 – COCO, TrackingNet, LaSOT, binary classification, 𝐿2 norm loss+AdamW nonlinear-update 26 Y N 2021
2021b) GOT-10k IoU-prediction
KYS (Bhat et al., 2020) DL ResNet-50 Block-4 TrackingNet, LaSOT, binary classification, 𝐿2 norm nonlinear-update 20 Y N 2020
GOT10k IoU-prediction loss+MSE+binary
cross-entropy+Adam
Ocean (Zhang et al., 2020) DL ResNet-50 Block-4 Youtube-BB, ILSVRC2015 binary classification, IoU loss+binary no-update 58 Y N 2020
VID, ILSVRC2015 DET, BBR cross-entropy+SGD
GOT-10k, COCO
Retina/FCOS-MAML (Wang DL ResNet-18 Block-4 COCO, GOT-10k, binary classification, 𝐿2 norm loss(C)+𝐿1 meta-update 40 y N 2020
et al., 2020) TrackingNet, LaSOT BBR norm loss loss(R)+SGD
SiamRPN++ (Li et al., DL ResNet-50 Block-3,4,5 COCO, ILSVRC2015 DET, binary classification, binary cross-entropy no-update 35 Y N 2019
2019b) ILSVRC2015 VID, BBR loss+𝐿1 loss+SGD
Youtube-BB
ASRCF (Dai et al., 2019) DCF VGG-M, Norm1(VGG-M), – binary classification, 𝐿2 norm loss+ADMM linear-update 28 Y N 2019
VGG-16 Conv4-3(VGG- scale searching
16),
HOG
SiamDW (Zhang and Peng, DL CIResNet-22 Block-3 ILSVRC2015 VID, binary classification, logistic/binary no-update 150 Y N 2019
2019) Youtube-BB BBR cross-entropy
loss+smooth 𝐿1
SACF (Zhang et al., 2018f) DL CaffeNet fc3, conv2 ILSVRC2015 VID binary classification, 𝐿2 norm+𝐿1 norm linear-update 23 Y N 2018
scale searching+spatial loss+
transformation
MAM (Chen et al., 2019) DL VGG-16 conv4-3, TC-128, OTB-100 binary classification, binary cross-entropy nonlinear-update 3 Y Y 2018
conv5–3 candidates sampling loss+ SGD
STNCF (Zhang et al., DCF – HOG – binary classification, 𝐿2 norm loss+ADMM linear-update 5 N N 2018
2018b) scale searching
SiamRPN (Li et al., 2018c) DL modified conv5 ILSVRC2015 VID, binary classification, smooth 𝐿1 loss+ no-update 200 Y N 2018
AlexNet Youtube-BB BBR binary cross-entropy
loss+SGD
SA-SIAM (He et al., 2018) DL AlexNet conv4, conv5 ILSVRC2015 VID binary classification, logistic loss+SGD no-update 50 Y N 2018
scale searching
CFCF (Gundogdu and DCF+DL VGG-M raw pixels, VOT2015, ILSVRC2015 VID binary classification, 𝐿2 norm loss+SGD linear-update 1.7 Y N 2018
Alatan, 2018) conv1-3, scale searching
conv5–2
TRACA (Choi et al., 2018) DCF+DL VGG-M conv3, conv4, VOC2012 binary classification, cross-entropy linear-update 101.3 Y Y 2018
conv5 scale searching loss+correlation filter
orthogonality loss+SGD
OAPT (Wang et al., 2018a) DCF+DL VGG-19 HOG, conv3-4, – binary classification, 𝐿2 norm loss linear-update 6 Y N 2018
conv4-4, scale searching
MCPF (Zhang et al., CPF VGG-19 conv3-4, – binary classification, least square loss+APG linear-update 1.96 Y N 2017
2017c) conv4-4, particle sampling
conv5–4
CSR-DCF (Lukežič et al., DCF – HOG, CN, HSV – binary classification, 𝐿2 norm loss+ADMM linear-update 13 N N 2017
2017) scale searching
BACF (Galoogahi et al., DCF – HOG – binary classification, ridge linear-update 35 N N 2017
2017b) scale searching regression+ADMM
DSiam (Guo et al., 2017) DL AlexNet conv4, conv5 ILSVRC2015 VID binary classification, logistic loss+BPTT+ linear-update 25 Y N 2017
scale searching SGD
SiamFC (Bertinetto et al., DL AlexNet conv5 ILSVRC2015 VID binary classification, logistic loss+SGD no-update 86 Y N 2016
deep Siamese trackers is that the semantic deep features could easily improvement compared with those at a low frame rate, and even
lead to model drift when the target undergoes severe occlusion and outperform some deep trackers.
the local feature matching with convolution. However, some particle As shown in Figs. 22 and 23, the conventional DCF trackers such
filter based trackers and part-based trackers could avoid the model drift as Staple (Bertinetto et al., 2016b), SAMF (Li and Zhu, 2014), DSST
problem to a certain extent due to the particle sampling strategy or (Danelljan et al., 2014), and KCF (Henriques et al., 2015) cannot
detection with local features. From the success plot for the attribute perform well with hand crafted features in case of motion blur. The
of Partial Occlusion, the TrDiMP tracker outperforms its counterpart periodic assumption utilized in these approaches during training and
DiMP50 by 8% while not equipped with the re-detection scheme in tracking may cause the boundary effects, especially when the target has
SiamR-CNN, the main reasons for this improvement are the robust undergone severe motion blur. We see that the discriminative capacity
embedded template feature from a set of history frames and the global improved by applying a variety of regularizations, such as the spatial
cross-attention in the Transformer decoder. regularization in SRDCF (Danelljan et al., 2015b), contextual regular-
ization in BACF (Galoogahi et al., 2017b), and temporal regularization
9.2. Illumination in DeepSTRCF (Li et al., 2018a). The deep features help to signifi-
cantly enhance the robustness of tracking, such as the deep trackers
The illumination variations tend to happen with the changing with different layers of features (e.g., CFNet-conv1, CFNet-conv2, and
of light on the background and the target objects, as the examples CFNet-conv5).
(i.e., 𝐶𝑎𝑟4 and 𝑆𝑘𝑎𝑡𝑖𝑛𝑔1) presented in Fig. 30. FlowTrack (Zhu et al., Enlarging the size of training samples in CFLB (Kiani Galoogahi
2018b) addressed the illumination by warping operation, which pro- et al., 2015) and the motion blur introduced in the data augmenta-
vides multi-dimensional information from previous frames to improve
tion (Zhu et al., 2018a), have further improved the discriminative abil-
the discriminative ability. In the case of illumination changes, the
ity and robustness of the tracker to motion blur. Moreover, CFCF (Gun-
motion information can be used to warp multiple features from pre-
dogdu and Alatan, 2018) achieved a competitive performance on mo-
vious t-1 frames to the 𝑡th frame, which provide diverse information
tion blur attribute in Figs. 22 and 23, we attribute the main reason of
for the tracker. Different from DeepSRDCF (Danelljan et al., 2015a)
the improvement to the deep features from fine-tuned VGG-m (Chat-
that employed a single resolution deep feature, the tracking results of
field et al., 2014) model with 200k triplet samples collected from
CCOT (Danelljan et al., 2016) and ECO (Danelljan et al., 2017a) demon-
ILSVRC (Russakovsky et al., 2015) dataset, while ECO (Danelljan et al.,
strated that multi-resolution deep features can significantly improve the
2017a) and CCOT (Danelljan et al., 2016) directly utilized the pre-
performance of DCF trackers.
trained VGG-m model (Chatfield et al., 2014). In addition, the deep
Another solution to handle appearance variations such as illumi-
trackers such as STMTrack (Fu et al., 2021) and TrDiMP (Wang et al.,
nation is to construct a mixture model to improve the robustness.
2021b) perform much better than other trackers in terms of accuracy
Zhang et al. (2018c) proposed a correlation particle filter (CPF) model,
which extends the KCF and exploits multiple correlation filters into the and robustness for the attribute of motion blur. Both the trackers
particle framework to handle the illumination problem. Furthermore, benefit from a more stable and accurate target model learned by the
the multi-task correlation filter learned with different parts and fea- Transformer module from the memory of historical frames instead of
tures (Zhang et al., 2019b) of the target object can also be incorporated the model learned from a single frame such as SiamRPN++ (Li et al.,
into the tracking framework to alleviate the illumination problem. It 2019b) and SiamBAN (Chen et al., 2020).
is because of that multiple correlation filters with respect to different
kinds of features (e.g., intensity, Haar-like features, HOG, and deep 9.4. Deformation
features) can cover a wide range of appearance variations such as
illumination and occlusion.
Different from occlusion, illumination, and motion blur, the tar-
In addition to the spatial regularization in SRDCF (Danelljan et al.,
get deformation occurs due to the appearance changes as the 𝑏𝑖𝑟𝑑1
2015b), DeepSTRCF (Li et al., 2018a) added the temporal regulariza-
and 𝑆𝑘𝑎𝑡𝑖𝑛𝑔2-1 sequences shown in Fig. 30, 𝑓 𝑒𝑟𝑛𝑎𝑛𝑑𝑜 and 𝑔𝑦𝑚𝑛𝑠𝑡𝑖𝑐𝑠1
tion on correlation filters, in which the filters are passively updated in
sequences shown in Fig. 31. Both low-level and high-level semantic
the frames with small appearance variations. Similarly, STNCF (Zhang
features are important to adapt to the appearance changes such as
et al., 2018b) used spatio-temporal nonlocal regularization that ex-
deformation. In early works, Locally Orderless Matching (i.e., LOT, Avi-
plores long-term spatio-temporal nonlocal information on DCF to im-
dan et al., 2015), sparse representation (e.g., SCM, Zhong et al., 2012
prove the robustness to illumination variations. As shown in Fig. 22,
and ASLA, Jia et al., 2012), long-term memory (e.g., MUSTer, Hong
for the illumination variation attribute, DeepSTRCF (Li et al., 2018a)
achieves an AUC score of 66.3%, which outperforms the et al., 2015) are proposed to enhance the robustness of the track-
DeepSRDCF (Danelljan et al., 2015a) by 4.2%. DeepSTRCF also out- ers. Recently, Zhu et al. (2018b) utilized the Siamese network that
performs deep trackers such as DaSiamRPN (Zhu et al., 2018a) and consists of historical and current branches, which is similar to main-
SA-SIAM (He et al., 2018) by 0.8%, and 1.9%, respectively. taining a long-term memory model of target appearance for handling
severe deformation. ACFT (Ma et al., 2018) simultaneously maintained
9.3. Motion blur long-term and short-term memory to improve the robustness of the
tracker.
The motion blur often occurs when the target or camera moves In addition, deformable attention in SiamAtt (Yu et al., 2020),
fast, as the sequences (i.e., 𝑀𝑜𝑡𝑜𝑟𝑅𝑜𝑙𝑙𝑖𝑛𝑔 and 𝑆𝑜𝑐𝑐𝑒𝑟) shown in Fig. 30. depth-wise cross-correlation and ResNet-driven Siamese network in
In this scenario, it is difficult for a tracker to distinguish the contour SiamRPN++ (Li et al., 2019b), geometrically invariant model (GIM) in
or detect the local features (e.g., texture, color, and edge) of the D3S (Lukežič et al., 2020) enable them to achieve superior performance
target. The reasons for this include the low camera frame rate that compared with the conventional DCF trackers and shallow Siamese
makes it challenging to capture the fast-moving targets and the camera trackers, which is illustrated by the Rv and EAO scores in Table 4. In
shaking that may cause the image blurry. To validate the effectiveness conclusion, existing deep trackers still need to improve their ability to
of tracking performance with different frame rates, Galoogahi et al. handle deformation.
(2017a) introduced a tracking benchmark with a higher frame rate Deeper and wider networks, large-scale datasets, and a variety of
(i.e., 240 fps). The tracking performances at a high frame rate show sample augmentation techniques can improve the performance of deep
that some trackers with hand-crafted features can achieve considerable trackers when dealing with deformation.
34
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508
Fig. 31. Qualitative examples: tracking results of the challenging sequences in VOT2018 on 8 video sequences, including 𝑏𝑚𝑥, 𝑏𝑢𝑡𝑡𝑒𝑟𝑓 𝑙𝑦, 𝑐𝑟𝑎𝑏𝑠1, 𝑓 𝑒𝑟𝑛𝑎𝑛𝑑𝑜, 𝑓 𝑖𝑠ℎ1, 𝑓 𝑟𝑖𝑠𝑏𝑒𝑒, ℎ𝑎𝑛𝑑,
and 𝑔𝑦𝑚𝑛𝑎𝑠𝑡𝑖𝑐𝑠1. The ground truth bounding boxes are in white.
The scale variation means the aspect ratio of the target object Background clutter can lead to the model drift problem as the
changes dramatically due to appearance changes, viewpoint changes, or 𝑆𝑜𝑐𝑐𝑒𝑟 sequence shown in Fig. 30. The performance of trackers in
motion changes. Most CFs based trackers estimate the target scale by a cluttered scenes has a big gap compared with the performance of
one-dimensional multi-scale searching method. However, the limitation trackers under other challenges as shown in Fig. 26. The tracker
of this strategy is that it only generates samples with their width and TrDiMP (Wang et al., 2021b) obtains a much better success core on the
height scaled with the same scale factor 𝑎𝑟 , where 𝑎 and 𝑟 are the scale LaSOT dataset for the Background Clutter attribute. The target model
increment factor and scale level, respectively. GOTURN (Held et al., produced by the Transformer module with multiple historical feature
2016) directly predicts the bounding box with a regressor which was maps exhibits more discriminative ability compared to DiMP50 (Bhat
implemented as multiple fully-connected layers. However, this tracker et al., 2019) and SiamRPN++ (Li et al., 2019b). Among DCF trackers,
was learned under the assumption of motion smoothness. Later, in RPN instead of a single sample cropped according to the previous location,
based deep trackers, the dense sampling strategy with multiple scales multiple samples that are consecutively collected during tracking also
and aspect ratios is used to generate diverse candidate proposals. By se- make contributions to the stability of the learning target model such as
lecting the optimal bounding box with IoU-Net, the Precise RoI Pooling CCOT (Danelljan et al., 2016) and ECO (Danelljan et al., 2017a).
used in ATOM (Danelljan et al., 2019) enables us to iteratively refine
the bounding box by gradient ascent method, please refer to Jiang et al.
9.8. Fast motion
(2018) for detail. More recently, in anchor-free detectors, the width and
height of the target object are predicted by a regression model to adapt
Fast motion means the motion of the target between two adjacent
to its shape variations. In addition, CGACD (Du et al., 2020) uses the
frames is larger than the size of the target. As shown in Fig. 26, the
corner-detection to adapt to the scale changes of the target. As shown
highest success score under the attribute of Fast Motion achieves 52.2%
in Fig. 26, SiamR-CNN (Voigtlaender et al., 2020) achieves the best
which is the worst of all the attributes. Compared to the OTB-100
performance under the scale variation, which employs Cascade Faster
dataset, there are more video sequences in the LaSOT dataset with
R-CNN (Cai and Vasconcelos, 2018) with a ResNet-101-FPN backbone.
smaller and fast-moving targets. Most DCF or deep Siamese trackers
Compared to Faster R-CNN (Ren et al., 2015), Cascade Faster R-CNN
employ a local search strategy, which cannot handle the fast motion
can sequentially improve the quality of the hypotheses and hence the
challenge. The DCF tracker ECO (Danelljan et al., 2017a) only achieves
accuracy of the tracking result.
a success score of 23.3% on the LaSOT dataset for fast motion, which
9.6. Out-of-view illustrates that traditional trackers are more likely to lose the targets
when they move fast. One approach to alleviate such problems is to
When the target is partially or fully leaves the camera field of view, enlarge the search region such as DaSiamRPN (Zhu et al., 2018a) and
most of the trackers are easy to lose the target and have difficulty in Ocean (Zhang et al., 2020). However, enlarging the search region may
acquiring when it reappears. To address this problem, several trackers bring more distractors around the target. In this case, we need to care-
design a re-detection strategy to capture the target during long-term fully select hyper-parameters such as window influence in SiamRPN (Li
tracking, such as SiamR-CNN (Voigtlaender et al., 2020), DRL-IS (Ren et al., 2018c). Because a large window influence would suppress the
et al., 2018), ACT (Chen et al., 2018), ACFT (Ma et al., 2018), MD- real target location when it is far from the center of the response map.
Net (Nam and Han, 2016), and MUSTer (Hong et al., 2015). For this
attribute, TrDiMP (Wang et al., 2021b) and SiamR-CNN (Voigtlaender 9.9. Rotation
et al., 2020) obtain the best success scores of 68.3% and 62.2% on
OTB-100 and LaSOT datasets respectively. Most short-term tracking The rotation includes in-plane rotation and out-of-plane rotation.
methods do not have in-depth study of re-detection scheme, such as The in-plane rotation means that a target rotates in the image plane
ECO (Danelljan et al., 2017a), SiamFC (Bertinetto et al., 2016c), and as the MotorRolling and Skiing sequences shown in Fig. 30. The out-
SiamRPN++ (Li et al., 2019b). of-plane rotation means the target rotates out of the image plane such
35
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508
as the Skating1 sequence in Fig. 30. As shown in the success plots and the running average strategy with exponentially decaying weights over
precision plots of each attributes in Figs. 22 and 23, the accuracy of time. In GradNet (Li et al., 2019a), an optimized based meta-learner
most trackers for out-of-plane rotation is lower than the results of in- was used to update the template as a non-linear process, in which
plane rotation. For example, the top three trackers (e.g., STMTrack, the discriminative information of gradients is computed to update the
SiamGAT, and SiamBAN) achieve success scores of 70.7%, 70.7%, target feature. To alleviate the limitation of no online learning in
and 68.7% in Fig. 22, where they achieve 2.3%, 3.2%, and 0.7% SiamRPN (Li et al., 2018c) and linear template update strategy in
improvement for the attribute of in-plane rotation. For the reason that DaSiamRPN (Zhu et al., 2018a), ATOM (Danelljan et al., 2019) learned
the appearance changes of a target that undergoes out-of-plane rotation the classification model online to enhance the discriminative power of
are more severe than in-plane rotation. In addition, deep trackers the classifier. In Han et al. (2021), the templates from first 𝑓 𝑟𝑎𝑚𝑒 and
exhibit significant superiority over the traditional trackers without deep 𝑓 𝑟𝑎𝑚𝑒𝑡−1 are fused in a multi-head attention to produce the dynamic
features. To enhance the network capacity, data augmentations are part representations, guided by a target mask that is generated from
adopted for offline training in most deep trackers, such as flip, rotation, the ground truth bounding box.
shift, blur, and scale. Both Dai et al. (2020) and Zhang et al. (2021) combined the local
search and global search schemes for solving the long-term tracking
9.10. Low resolution problem. Zhang et al. (2021) applied an online learned verification
network (Jung et al., 2018) for identifying the target from the candi-
Low Resolution means the pixels in the target box are very small dates detected by a regression network (RPN), while Dai et al. (2020)
(e.g., 400 pixels in OTB-100). So, it is difficult to develop a dis- employed a RT-MDNet based verifier to further evaluate the results
criminative appearance model for traditional trackers. Although em- from the online-updated local tracker (e.g., ATOM and ECO). The
ploying multiple features and an effective model update strategy, the confidence of the verifier controls the tracker whether goes to global
ECO (Danelljan et al., 2017a) tracker only obtains a success score of search or continues to conduct local tracking in the next frame.
61.7% on OTB-100 for the attribute of low resolution. The DiMP50
(Bhat et al., 2019) tracker obtains a success score of 59.5% and a 9.12. Motion model
precision score of 81.4% even with the target model prediction module,
showing poor localization ability. This is in part due to the loss of To handle the partial occlusion, a motion model aims to describe the
information in deep neural networks under low resolution. The top temporal correlation of the target states between consecutive frames.
trackers such as SiamR-CNN, SiamCAR, and SA-SIAM employ the fea- The motion models include the Gaussian model, sliding window, ra-
ture pyramid or multiple layers of CNNs, and outperform TrDiMP and dius sliding window, cosine window, optical flow, temporal features,
DiMP50 by a large margin on the OTB-100 dataset as the success plots Region Proposal Network (RPN), Recurrent Neural Network (RNN), and
of low-resolution attributes shown in Fig. 22. However, DiMP50 (Bhat action-decision process. In SINT+ (Tao et al., 2016), optical flow is
et al., 2019) still shows high success score on LaSOT for the attribute used to filter out the motion inconsistent candidates. Most tracking-
of low resolution, compared to SiamBAN (Chen et al., 2020) and by-detection trackers use the sliding window approach to generate
SiamCAR (Guo et al., 2020). A potential reason is that DiMP50 contains candidate samples. Struck (Hare et al., 2016) generates the samples
an effective model update strategy when tracking long-term video within a radius. The cosine window is a modification of the simple
sequences. sliding window strategy. As a kind of motion model, the cosine window
puts more emphasis near the center of the target. In addition, it can
9.11. Model update suppress background regions or alleviate the boundary discontinuities
problem. In particle filter framework, the motion model is imple-
To adapt the target appearance changes, conventional DCF trackers mented as the transition model 𝑝(𝑥𝑡 |𝑥𝑡−1 ), where 𝑥𝑡 and 𝑥𝑡−1 denote
(e.g., MOSSE, Bolme et al., 2010, DSST, Danelljan et al., 2017b and the observations of frame 𝑡 and frame 𝑡-1. Generally, the transition
SRDCF, Danelljan et al., 2015a) usually update the model with an on- model 𝑝(𝑥𝑡 |𝑥𝑡−1 ) is implemented as a Gaussian distribution 𝑝(𝑥𝑡 |𝑥𝑡−1 ) =
line rule (Danelljan et al., 2014). For example, CCOT (Danelljan et al., (𝑥𝑡 ; 𝑥𝑡−1 , 𝛴), where 𝛴 is a diagonal covariance matrix whose elements
2016) learned the filters iteratively when a new frame comes. Whereas, are usually the corresponding variances of affine parameters. Recently,
the continuous model updating strategy is expensive and the model MemTrack (Yang and Chan, 2018) and STMTrack (Fu et al., 2021)
is sensitive to sudden appearance changes. Recently, ECO (Danelljan maintain a memory unit that contains multiple features of historical
et al., 2017a), MDNet (Nam and Han, 2016), and MEEM (Zhang et al., frames to adapt target variations. ConvGRU (Bhat et al., 2020) is
2014c) utilized a sparse update scheme to improve training efficiency another approach to capture the correspondence relationship between
and reduce computational load. Specifically, in MDNet (Nam and Han, two adjacent frames which can be used to propagate the target states
2016), they exploited two different update intervals including long- in consecutive frames. For the action-decision process, the trackers
term and short-term updates. Because the frequency of model update mainly learn to make decisions to search the target state with rein-
affects both the tracking speed and expressiveness of the model, most forcement learning. In Table 8, we present a detailed categorizations
of the existing online update methods employ an empirical criterion. of state-of-the-art trackers according to different motion models.
Afterward, the experimental results of ECO (Danelljan et al., 2017a)
show that infrequent model update can improve tracking performance. 9.13. Hyper-parameters tuning
DeepSTRCF (Li et al., 2018a) employed an online passive–aggressive
(PA) algorithm to learn the filters, which improves the robustness of In most trackers, there are a set of hyper-parameters that need to be
the tracker. In addition to the continuous or fixed interval update tuned during tracking for different datasets for the reason of different
strategy, a new criterion peak-versus-noise ratio (PNR) was introduced distribution of these tracking datasets. These hyper-parameters usually
by Zhu et al. (2018b) to control the model update at the right time. have significant impacts on tracking performance.
DiMP50 (Bhat et al., 2019) maintains a number of history samples for In SiamFC (Bertinetto et al., 2016c), important hyper-parameters
learning the prediction model, which is similar to the correlation filters are numScale, scale-step, and scale-LR. A common approach for finding
learning process in ECO (Danelljan et al., 2017a). While this strategy optimal hyper-parameters is the random search with a uniform distri-
decreases the tracking speed compared with other faster Siamese track- bution on a reasonable range for each parameter. The more parameters,
ers. An alternative way is to learn an update strategy automatically by the more combinations we need to evaluate on the test sets.
leveraging the merits of deep neural networks. Zhang et al. (2019a) In DCF trackers such as BACF (Galoogahi et al., 2017b) and
proposed the UpdateNet to learn an adaptive update strategy, instead of ECO (Danelljan et al., 2017a), more hyper-parameters need to be tuned
36
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508
Table 8
Trackers categorized by different motion models.
Motion model Trackers
Sliding window Struck (Hare et al., 2016)
Uniform sampling DCFST (Zheng et al., 2020)
Gaussian model CPF (Zhang et al., 2018c), MCPF (Zhang et al., 2017c), BranchOut (Han et al.,
2017), MMST (Chen et al., 2017b), MDNet (Nam and Han, 2016), FCNT (Wang
et al., 2015), SST (Zhang et al., 2015), ADNet (Yun et al., 2017), NR-MVDLSR (Kang
et al., 2019), GFS-DCF (Xu et al., 2019), RT-MDNet (Jung et al., 2018), ACT (Chen
et al., 2018), VITAL (Song et al., 2018), DLRT (Sui et al., 2018)
Cosine window MOSSE (Bolme et al., 2010), DSST (Danelljan et al., 2014), MemTrack (Yang and
Chan, 2018), SiamRN (Cheng et al., 2021), SiamGAT (Guo et al., 2021), PrDiMP50
(Danelljan et al., 2020), SianRPN (Li et al., 2018c), SiamRPN++ (Li et al., 2019b),
DiMP50 (Bhat et al., 2019), MLT (Choi et al., 2019), SPM-Tracker (Wang et al.,
2019a), Retina/FCOS-MAML (Wang et al., 2020), SiamBAN (Chen et al., 2020),
CGACD (Du et al., 2020), BGDT (Huang et al., 2019), SACF (Zhang et al., 2018f),
SA-SIAM (He et al., 2018), CFCF (Gundogdu and Alatan, 2018), TRACA (Choi et al.,
2018), Ocean (Zhang et al., 2020), ASRCF (Dai et al., 2019), PAC (Zhang et al.,
2018a), SiamTri (Dong and Shen, 2018), CFNet (Valmadre et al., 2017), DRT (Sun
et al., 2018a), ACFT (Ma et al., 2018), StructSiam (Zhang et al., 2018e), CREST
(Song et al., 2017), CSR-DCF (Lukežič et al., 2017), EAST (Huang et al., 2017a),
DSiam (Guo et al., 2017), SiamFC (Bertinetto et al., 2016c), BACF (Galoogahi et al.,
2017b), CCOT (Danelljan et al., 2016), ECO (Danelljan et al., 2017a), SCT (Choi
et al., 2016), Staple (Bertinetto et al., 2016b), HCFT (Ma et al., 2015a), KCF
(Henriques et al., 2015), LCT (Ma et al., 2015c), SRDCF (Danelljan et al., 2015b)
Temporal features STMTrack (Fu et al., 2021), MemTrack (Yang and Chan, 2018), MMST (Chen et al.,
2017b), MUSTer (Hong et al., 2015), STNCF (Zhang et al., 2018b), UpdateNet
(Zhang et al., 2019a)
RNN HART (Kosiorek et al., 2017), KYS (Bhat et al., 2020)
Flow information FlowTrack (Zhu et al., 2018b), SINT+ (Tao et al., 2016)
Action-decision process ADNet (Yun et al., 2017), ACT (Chen et al., 2018), DRL-IS (Ren et al., 2018)
than deep trackers, such as search area scale, number of scales, learning extraction, the strategy of the model update, and the hardware that the
rate for the online update, layers of deep neural networks used for algorithms are conducted on. The early DCF trackers with hand-crafted
feature representation, number of scales, scale step, parameters of the features such as MOSSE (Bolme et al., 2010) and KCF (Henriques
optimizer (e.g., regularization factor, iterations), spatial bandwidth of et al., 2015) have achieved comparable high speed, while the accuracy
2D Gaussian function, parameters of window function and so on. For and robustness still have a notable gap with state-of-the-art trackers.
most DCF trackers with a linear model update scheme, the learning rate By combining the deep and hand-crafted features, the accuracy and
has a significant impact on their tracking performance. robustness have been improved significantly, according to the results in
In SiamRPN (Li et al., 2018c) and SiamRPN++ (Li et al., 2019b), both Fig. 21 and Table 4. However, the speed is significantly decreased
there are three hyper-parameters include learning rate (lr), window if deep features are incorporated into conventional trackers such as
ECO (Danelljan et al., 2017a) and CCOT (Danelljan et al., 2016), in
influence, and penalty_k, which are in the range of [0,1]. Learning
which the filters are learned by solving the optimization problem in
rate is used as a damping factor for location and scale update. Win-
high dimensional feature space.
dow_influence is settled on the window function to penalize large
Apart from the breakthroughs in accuracy and speed that benefit
displacements. Penalty_k is used to suppress large changes in size and
from the end-to-end deep learning architecture, such as
ratio. Apart from the above three hyper-parameters in SiamRPN (Li SiamFC (Bertinetto et al., 2016c), EAST (Huang et al., 2017a), CFNet
et al., 2018c), Ocean (Zhang et al., 2020) tracker also contains a weight (Valmadre et al., 2017), FlowTrack (Zhu et al., 2018b), SA-SIAM (He
𝜔 to combine the outputs of two different classification networks. et al., 2018), SiamRPN (Li et al., 2018c), DaSiamRPN (Zhu et al.,
There is a toolkit that can be used to tune the hyperparameters 2018a), SiamRPN++ (Li et al., 2019b), and SiamBAN (Chen et al.,
such as Optuna (Akiba et al., 2019) which was used in Ocean (Zhang 2020), most Siamese network based trackers can take full advantages
et al., 2020). Optuna can help researchers construct the parameter of GPU and achieve high speed performance. In order to dynamically
search space dynamically and provide pruning strategies. In addition to select appropriate feature layers, EAST (Huang et al., 2017a) speeds
employing Optuna to find the optimal hyper-parameters automatically, up tracking by learning an agent to decide whether forward to the
an alternative way is to find optimal hyper-parameters in two steps. next layer during target locating or not, which achieved 158.9 fps on
Firstly, we can construct a large search space with a fixed step size for GPU. Recently, Siamese trackers, including SiamRPN (Li et al., 2018c)
a parameter. After evaluating the trackers in the large search space, we and DaSiamRPN (Zhu et al., 2018a), improve the tracking speed sig-
can find a suboptimal and small search space. Secondly, we utilize the nificantly due to their efficient detection module such as RPN (Region
Optuna for optimal parameter searching. Proposal Network). The speed of DaSiamRPN (Zhu et al., 2018a) can
There is only one hyperparameter (window influence) in TransT achieve 160 fps with comparable accuracy and robustness. However,
(Chen et al., 2021) tracker, the output of the detector is used as the it is a tradeoff between employing complex neural networks and
tracking result and the scale and displacement do not need to be exploiting fast tracking speed when designing a task-specific tracking
penalized. It seems that the fewer hyper-parameters a tracker contains, model.
the easier to find the optimal model for a test dataset.
10. Future directions and open issues
9.14. Speed Visual object tracking has been promoted by a variety of aspects,
including large-scale tracking datasets, high capacity of backbone net-
Tracking speed is an important metric for evaluating the trackers, works, training methods, model update schemes, object detection tech-
especially for the requirement of real-time processing in practical ap- nologies, and so on. In this section, we further present the future
plications. The key factors that affect the tracking speed include feature research directions and open issues for visual object tracking.
37
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508
• Lightweight Model Exploring a deep tracker is easy to be confused and lose the target. In KYS (Bhat
et al., 2020), scene information was represented as dense localized state
The backbone networks are important for the representation of a
vectors that are propagated during the sequence. The vectors provide
target object. In visual object tracking, we have witnessed multi-kinds
additional information to the appearance model for tracking. Besides
of neural networks for image recognition tasks, including AlexNet,
the images and the ground truth bounding boxes, the LaSOT dataset
VggNet-M, VGGNet-16, VGGNet-19, GoogLeNet, ResNet-18, ResNet-
also provides the additional language specification for every sequence.
50, and their variations. On the one hand, deeper and wider neural
Even a short description can provide very useful information about the
networks could enhance the model performance. On the other hand,
target and its scene, including the color, behavior, and surroundings of
they bring more computation workload and a large memory foot-
the target in the whole sequence. Exploring such language information
print. For real-world applications such as mobile devices and industrial
combined with the visual information in tracking is still an open issue.
robotics where the model size and compute resource are constrained,
The semantic information from language specification can help the
we need to construct lightweight neural networks with fewer params
tracker to locate the target in complex scenes.
and flops while maintaining high tracking performance. Neural Archi-
tecture Search (NAS) (Yan et al., 2021) has shown its great appeal and
advantages in deep learning, providing an effective way to search for 11. Conclusions
an optimal lightweight model by constraining the params and flops.
Therefore, it will be another promising research direction in visual In this paper, we presented a survey of traditional and deep methods
object tracking. for visual object tracking. We provided both the quantitative and
qualitative tracking results on multiple benchmarks. We analyzed the
• Model Update
advantages and disadvantages of state-of-the-art single object track-
To adapt to appearance changes of the target object, many matching ing algorithms in detail and made a comprehensive analysis of their
based trackers incrementally update their templates. Most CFs based performance on five tracking datasets. The generative trackers have
trackers linearly update their model at each frame. For deep trackers, advantages for handling challenging scenarios such as occlusion and
meta learning can be applied to update the model with samples col- large-scale variation via the particle sampling strategy, and they can
lected from current and previous frames on-the-fly. Dai et al. (2020) be combined with different appearance models (e.g., sparse represen-
learns a Meta-Updater with a cascaded LSTM as the geometric, discrim- tation, subspace representation, and motion energy). With the tracking-
inative, and appearance cues as input. This Meta-Updater determines by-detection scheme, many trackers attempted to build power dis-
whether the local tracker or verifier should be updated based on criminative classifiers with hand-crafted or deep features. The analysis
the current tracking state. In addition to most single template-based from previous sections indicates that an appropriate motion model
Siamese trackers, the space–time memory network (Fu et al., 2021) (summarized in Table 8) helps to improve the capacity for both the
which consists of multiple historical frames provides rich appearance generative and discriminative trackers.
information of the target for developing the target model. Therefore, The experimental results illustrated that recently proposed deep
employing the temporal features to update the target model online trackers achieved superior performance on public tracking datasets in
could be another attractive research direction in long sequence video terms of accuracy, robustness, and speed due to the powerful feature
tracking. extractor, accurate bounding box regressor, discriminative classifier,
and fully-convolutional networks, etc. In addition to conventional ways
• Combining the Local and Global Search Strategy of convolution or correlation (local matching) for information inter-
action in deep Siamese network trackers, deformable convolution or
Transformer approaches have recently become popular in many
Transformer extended to perform feature matching in a global manner.
kinds of vision tasks, such as image recognition, object detection, and
In practice, these two schemes can work together in a complementary
segmentation. Several attempts have been made to employ the Trans-
way to further improve the capacity of the tracker, and how to combine
former inside or outside the backbone network to improve the perfor-
the two kinds of operations in tracking is still an open issue for further
mance, please refer to Section 5.1.5. While, pure attention models such
study.
as SASA (Ramachandran et al., 2019), BoTNet (Srinivas et al., 2021),
ViT (Sharir et al., 2021), Pyramid Vision Transformer (PVT) (Wang Besides the detailed overview of the literature on state-of-the-art
et al., 2021a) still have not been explored to combine with Transform- trackers, we gave a summary of these trackers with different attributes
ers for visual tracking. Having the advantage of learning long-range in Table 7 to provide another comparison. We analyzed the differ-
dependencies, the self-attention (applied in one input branch) and ent tracking scenarios such as occlusion, deformation, scale variation,
cross-attention (between template and search branches) enable the model update, and hyper-parameters tuning in detail to further under-
information to interact with each other in a global manner, which is stand the characteristics of the trackers. At the end of this survey, we
different from convolution/correlation operation. Such a scheme can gave suggestions for open issues and listed potential future directions
improve the robustness of the tracker in dealing with challenging sce- in visual object tracking.
narios such as partial occlusion or large-scale deformation. Therefore,
combining Transformers with the Siamese network will be the next CRediT authorship contribution statement
evolution to facilitate the tracking research.
On the one hand, Transformer can be utilized to fuse template Fei Chen: Conceptualization, Methodology, Formal analysis, Vali-
features (either from single or multiple history frames) and search dation, Writing – original draft, Visualization. Xiaodong Wang: Fund-
features; on the other hand, how to combine local search in convolution ing acquisition, Methodology, Resources, Supervision. Yunxiang Zhao:
with global search in Transformer is still an open issue. Writing – review & editing, Visulization. Shaohe Lv: Conceptualization,
Investigation. Xin Niu: Supervision.
• Tracking with Multi-model Information
Existing tracking algorithms mainly search the target with vision infor- Declaration of competing interest
mation. In other words, they mainly localize the target by developing a
model to estimate the similarity between the target and search region. The authors declare that they have no known competing finan-
While the target model did not establish the relationship between the cial interests or personal relationships that could have appeared to
target and its scene. Because there are similar objects in the scene, even influence the work reported in this paper.
38
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508
Acknowledgments Chen, Z., You, X., Zhong, B., Li, J., Tao, D., 2017b. Dynamically modulated mask sparse
tracking. IEEE Trans. Cybern. 47, 3706–3718.
Chen, Z., Zhong, B., Li, G., Zhang, S., Ji, R., 2020. Siamese box adaptive network for
This work was supported by the Science and Technology Foundation
visual tracking. In: IEEE Conference on Computer Vision and Pattern Recognition.
of State Key Laboratory of Parallel and Distributed Processing, China pp. 6668–6677.
under Grant 6142110180405. Cheng, S., Zhong, B., Li, G., Liu, X., Tang, Z., Li, X., Wang, J., 2021. Learning to filter:
Siamese relation network for robust tracking. arXiv:2104.00829.
Choi, J., Chang, H.J., Fischer, T., Yun, S., Lee, K., Jeong, J., Demiris, Y., Choi, J.Y.,
References
2018. Context-aware deep feature compression for high-speed visual tracking. In:
IEEE Conference on Computer Vision and Pattern Recognition. pp. 479–488.
Adelson, E.H., Bergen, J.R., 1985. Spatiotemporal energy models for the perception of Choi, J., Chang, H.J., Yun, S., Fischer, T., Demiris, Y., Choi, J.Y., et al., 2017. Atten-
motion. J. Opt. Soc. Am. A 2, 284–299. tional correlation filter network for adaptive visual tracking. In: IEEE Conference
Akiba, T., Sano, S., Yanase, T., Ohta, T., Koyama, M., 2019. Optuna: A next-generation on Computer Vision and Pattern Recognition, Vol. 2, p. 7.
hyperparameter optimization framework. In: Proceedings of the 25rd ACM SIGKDD Choi, J., Jin Chang, H., Jeong, J., Demiris, Y., Young Choi, J., 2016. Visual tracking
International Conference on Knowledge Discovery and Data Mining. pp. 2623–2631. using attention-modulated disintegration and integration. In: IEEE Conference on
Arulampalam, M.S., Maskell, S., Gordon, N., Clapp, T., 2002. A tutorial on particle Computer Vision and Pattern Recognition. pp. 4321–4330.
filters for online nonlinear/non-gaussian Bayesian tracking. IEEE Trans. Signal Choi, J., Kwon, J., Lee, K.M., 2019. Deep meta learning for real-time target-aware visual
Process. 50, 174–188. tracking. In: IEEE International Conference on Computer Vision. pp. 911–920.
Avidan, S., 2007. Ensemble tracking. IEEE Trans. Pattern Anal. Mach. Intell. 29, Corbetta, M., Shulman, G.L., 2002. Control of goal-directed and stimulus-driven
261–271. attention in the brain. Nat. Rev. Neurosci. 3, 201–215.
Avidan, S., Levi, D., Barhillel, A., Oron, S., 2015. Locally orderless tracking. Int. J. Creswell, A., White, T., Dumoulin, V., Arulkumaran, K., Sengupta, B., Bharath, A.A.,
Comput. Vis. 111, 213–228. 2018. Generative adversarial networks: An overview. IEEE Signal Process. Mag. 35,
Babenko, B., Yang, M.-H., Belongie, S., 2011. Robust object tracking with online 53–65.
multiple instance learning. IEEE Trans. Pattern Anal. Mach. Intell. 33, 1619–1632. Cui, Z., Cai, Y., Zheng, W., Xu, C., Yang, J., 2019. Spectral filter tracking. IEEE Trans.
Bahdanau, D., Cho, K., Bengio, Y., 2014. Neural machine translation by jointly learning Image Process. 28, 2479–2489.
to align and translate. arXiv:1409.0473. Cui, Z., Xiao, S., Feng, J., Yan, S., 2016. Recurrently target-attending tracking. In: IEEE
Baker, S., Matthews, I., 2004. Lucas-kanade 20 years on: A unifying framework. Int. J. Conference on Computer Vision and Pattern Recognition. pp. 1449–1458.
Comput. Vis. 56, 221–255. Dai, K., Wang, D., Lu, H., Sun, C., Li, J., 2019. Visual tracking via adaptive spatially-
Bertinetto, L., Henriques, J.F., Valmadre, J., Torr, P., Vedaldi, A., 2016a. Learning feed- regularized correlation filters. In: IEEE Conference on Computer Vision and Pattern
forward one-shot learners. In: Advances in Neural Information Processing Systems. Recognition. pp. 4670–4679.
pp. 523–531. Dai, K., Zhang, Y., Wang, D., Li, J., Lu, H., Yang, X., 2020. High-performance long-term
Bertinetto, L., Valmadre, J., Golodetz, S., Miksik, O., Torr, P.H.S., 2016b. Staple: tracking with meta-updater. In: IEEE Conference on Computer Vision and Pattern
Complementary learners for real-time tracking. In: IEEE Conference on Computer Recognition. pp. 6297–6306.
Vision and Pattern Recognition. pp. 1401–1409. Danelljan, M., Bhat, G., Khan, F.S., Felsberg, M., 2017a. ECO: Efficient convolution
Bertinetto, L., Valmadre, J., Henriques, J.F., Vedaldi, A., Torr, P.H.S., 2016c. Fully- operators for tracking. In: IEEE Conference OnComputer Vision and Pattern
convolutional Siamese networks for object tracking. In: European Conference on Recognition. pp. 6931–6939.
Computer Vision. pp. 850–865. Danelljan, M., Bhat, G., Khan, F.S., Felsberg, M., 2019. ATOM: Accurate tracking
Bhat, G., Danelljan, M., Gool, L.V., Timofte, R., 2020. Know your surroundings: by overlap maximization. In: IEEE Conference on Computer Vision and Pattern
Exploiting scene information for object tracking. In: European Conference on Recognition. pp. 4660–4669.
Computer Vision. pp. 205–221. Danelljan, M., Häger, G., Khan, F., Felsberg, M., 2014. Accurate scale estimation
Bhat, G., Danelljan, M., Van Gool, L., Timofte, R., 2019. Learning discriminative model for robust visual tracking. In: British Machine Vision Conference, Nottingham,
prediction for tracking. In: IEEE International Conference on Computer Vision. pp. September 1-5, 2014. BMVA Press.
6182–6191. Danelljan, M., Hager, G., Khan, F.S., Felsberg, M., 2015a. Convolutional features
Bhat, G., Johnander, J., Danelljan, M., Shahbaz Khan, F., Felsberg, M., 2018. Unveiling for correlation filter based visual tracking. In: IEEE International Conference on
the power of deep tracking. In: European Conference on Computer Vision. pp. Computer Vision Workshop. pp. 621–629.
483–498. Danelljan, M., Hager, G., Khan, F.S., Felsberg, M., 2017b. Discriminative scale space
Bolme, D.S., Beveridge, J.R., Draper, B.A., Lui, Y.M., 2010. Visual object tracking using tracking. IEEE Trans. Pattern Anal. Mach. Intell. 39, 1561–1575.
adaptive correlation filters. In: IEEE Conference on Computer Vision and Pattern Danelljan, M., Hager, G., Shahbaz Khan, F., Felsberg, M., 2015b. Learning spatially
Recognition. pp. 2544–2550. regularized correlation filters for visual tracking. In: IEEE International Conference
Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J., 2011. Distributed optimization on Computer Vision. pp. 4310–4318.
and statistical learning via the alternating direction method of multipliers. Found. Danelljan, M., Robinson, A., Khan, F.S., Felsberg, M., 2016. Beyond correlation
Trends Mach. Learn. 3, 1–122. filters: Learning continuous convolution operators for visual tracking. In: European
Briechle, K., Hanebeck, U.D., 2001. Template matching using fast normalized cross Conference on Computer Vision. pp. 472–488.
correlation. In: Optical Pattern Recognition XII, vol. 4387, pp. 95–103. Danelljan, M., Van Gool, L., Timofte, R., 2020. Probabilistic regression for visual
Bromley, J., Guyon, I., LeCun, Y., Säckinger, E., Shah, R., 1994. Signature verification tracking. In: IEEE Conference on Computer Vision and Pattern Recognition. pp.
using a "Siamese" time delay neural network. In: Advances in Neural Information 7183–7192.
Processing Systems. pp. 737–744. Dekel, T., Oron, S., Rubinstein, M., Avidan, S., Freeman, W.T., 2015. Best-buddies
Cai, Z., Vasconcelos, N., 2018. Cascade R-CNN: Delving into high quality object similarity for robust template matching. In: 2015 IEEE Conference on Computer
detection. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. Vision and Pattern Recognition. pp. 2021–2029.
6154–6162. Dollár, P., Appel, R., Belongie, S., Perona, P., 2014. Fast feature pyramids for object
Caicedo, J.C., Lazebnik, S., 2015. Active object localization with deep reinforcement detection. IEEE Trans. Pattern Anal. Mach. Intell. 36, 1532–1545.
learning. In: IEEE International Conference on Computer Vision. pp. 2488–2496. Dong, X., Shen, J., 2018. Triplet loss in siamese network for object tracking. In:
Cannons, K., Gryn, J.M., Wildes, R.P., 2010. Visual tracking using a pixelwise spa- European Conference on Computer Vision. pp. 459–474.
tiotemporal oriented energy representation. In: European Conference on Computer Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., Van
Vision. pp. 511–524. Der Smagt, P., et al., 2015. Flownet: Learning optical flow with convolutional
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S., 2020. End- networks. In: IEEE International Conference on Computer Vision. pp. 2758–2766.
to-End object detection with transformers. In: European Conference on Computer Doucet, A., De Freitas, N., Gordon, N., 2001. An introduction to sequential Monte Carlo
Vision. pp. 213–229. methods. In: Sequential Monte Carlo Methods in Practice. pp. 3–14.
Čehovin, L., Leonardis, A., Kristan, M., 2016. Visual object tracking performance Dredze, M., Kulesza, A., Crammer, K., 2010. Multi-domain learning by confidence-
measures revisited. IEEE Trans. Image Process. 25, 1261–1274. weighted parameter combination. Mach. Learn. 79, 123–149.
Chatfield, K., Simonyan, K., Vedaldi, A., Zisserman, A., 2014. Return of the devil Du, F., Liu, P., Zhao, W., Tang, X., 2020. Correlation-guided attention for corner
in the Details: Delving deep into convolutional nets. In: British Machine Vision detection based visual tracking. In: IEEE Conference on Computer Vision and
Conference. Pattern Recognition. pp. 6836–6845.
Chen, B., Li, P., Sun, C., et al., 2019. Multi attention module for visual tracking. Pattern Duan, L., Tsang, I.W., Xu, D., Chua, T.-S., 2009. Domain adaptation from multiple
Recognit. 87, 80–93. sources via auxiliary classifiers. In: International Conference on Machine Learning.
Chen, Z., Luo, L., Huang, D., Wen, M., Zhang, C., 2017a. Exploiting a depth context pp. 289–296.
model in visual tracking with correlation filter. Front. Inf. Technol. Electron. Eng. Fan, H., Ling, H., 2017a. Parallel tracking and verifying: A framework for real-time
18, 667–679. and high accuracy visual tracking. In: IEEE International Conference on Computer
Chen, B., Wang, D., Li, P., Wang, S., Lu, H., 2018. Real-time’Actor-Critic’Tracking. In: Vision. pp. 5486–5494.
European Conference on Computer Vision. pp. 318–334. Fan, H., Ling, H., 2017b. SANet: Structure-aware network for visual tracking. In:
Chen, X., Yan, B., Zhu, J., Wang, D., Yang, X., Lu, H., 2021. Transformer tracking. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops. pp.
IEEE Conference on Computer Vision and Pattern Recognition. pp. 8126–8135. 2217–2224.
39
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508
Fan, H., Ling, H., 2019. Siamese cascaded region proposal networks for real-time visual Huang, R., Zhang, S., Li, T., He, R., 2017. Beyond face rotation: Global and local
tracking. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. perception GAN for photorealistic and identity preserving frontal view synthesis.
7952–7961. In: IEEE International Conference on Computer Vision. pp. 2439–2448.
Fan, H., Ling, H., Lin, L., Yang, F., Chu, P., Deng, G., Yu, S., Bai, H., Xu, Y., Liao, C., Huang, L., Zhao, X., Huang, K., 2018. GOT-10k: A large high-diversity benchmark for
0000. LaSOT Evaluation Toolkit, https://github.com/HengLan/LaSOT_Evaluation_ generic object tracking in the wild. arXiv:1810.11981.
Toolkit. Huang, L., Zhao, X., Huang, K., 2019. Bridging the gap between detection and tracking:
Fan, H., Ling, H., Lin, L., Yang, F., Chu, P., Deng, G., Yu, S., Bai, H., Xu, Y., Liao, C., A unified approach. In: IEEE International Conference on Computer Vision. pp.
2019. LaSOT: A high-quality benchmark for large-scale single object tracking. In: 3999–4009.
IEEE Conference on Computer Vision and Pattern Recognition. pp. 5374–5383. Isard, M., Blake, A., 1998. Condensation—Conditional density propagation for visual
Fan, J., Song, H., Zhang, K., Liu, Q., Lian, W., 2018. Complementary tracking via dual tracking. Int. J. Comput. Vis. 29, 5–28.
color clustering and spatio-temporal regularized correlation learning. IEEE Access Isola, P., Zhu, J.-Y., Zhou, T., Efros, A.A., 2017. Image-to-image translation with
6, 56526–56538. conditional adversarial networks. In: IEEE Conference on Computer Vision and
Fiaz, M., Mahmood, A., Javed, S., Jung, S.K., 2019. Handcrafted and deep trackers: Pattern Recognition. pp. 5967–5976.
Recent visual object tracking approaches and trends. ACM Comput. Surv. 52, 1–44. Jaderberg, M., Simonyan, K., Zisserman, A., et al., 2015. Spatial transformer networks.
Finn, C., Abbeel, P., Levine, S., 2017. Model-agnostic meta-learning for fast adaptation In: Advances in Neural Information Processing Systems. pp. 2017–2025.
of deep networks. In: IEEE International Conference on Machine Learning. pp. Ji, H., Ling, H., Wu, Y., Bao, C., 2012. Real time robust L1 tracker using accelerated
1126–1135. proximal gradient approach. In: IEEE Conference on Computer Vision and Pattern
Fu, Z., Liu, Q., Fu, Z., Wang, Y., 2021. Stmtrack: Template-free visual tracking with Recognition. pp. 1830–1837.
space-time memory networks. arXiv:2104.00324. Jia, X., Lu, H., Yang, M.-H., 2012. Visual tracking via adaptive structural local
Galoogahi, H.K., Fagg, A., Huang, C., Ramanan, D., Lucey, S., 2017a. Need for speed: A sparse appearance model. In: IEEE Conference on Computer Vision and Pattern
benchmark for higher frame rate object tracking. In: IEEE International Conference Recognition. pp. 1822–1829.
on Computer Vision. pp. 1134–1143. Jiang, B., Luo, R., Mao, J., Xiao, T., Jiang, Y., 2018. Acquisition of localization
Galoogahi, H.K., Fagg, A., Lucey, S., 2017b. Learning background-aware correlation confidence for accurate object detection. In: European Conference on Computer
filters for visual tracking. In: IEEE Conference on Computer Vision and Pattern Vision. Springer International Publishing, pp. 816–832.
Recognition. pp. 21–26. Jung, I., Son, J., Baek, M., Han, B., 2018. Real-time MDNet. In: European Conference
Gavves, E., Bertinetto, L., Henriques, J., Vedaldi, A., Torr, P., Tao, R., Valmadre, J., on Computer Vision. pp. 83–98.
2018. Long-term tracking in the wild: A benchmark. In: European Conference on Kalal, Z., Mikolajczyk, K., et al., 2011. Tracking-learning-detection. IEEE Trans. Pattern
Computer Vision. pp. 670–685. Anal. Mach. Intell. 34, 1409–1422.
Girshick, R., Donahue, J., Darrell, T., Malik, J., 2014. Rich feature hierarchies for Kang, B., Zhu, W.-P., Liang, D., Chen, M., 2019. Robust visual tracking via nonlocal
accurate object detection and semantic segmentation. In: IEEE Conference on regularized multi-view sparse representation. Pattern Recognit. 88, 75–89.
Computer Vision and Pattern Recognition. pp. 580–587. Khan, Z., Balch, T., Dellaert, F., 2004. A rao-blackwellized particle filter for
Gundogdu, E., Alatan, A.A., 2018. Good features to correlate for visual tracking. IEEE Eigentracking. In: IEEE Conference on Computer Vision and Pattern Recognition.
Trans. Image Process. 27, 2526–2540. Kiani Galoogahi, H., Sim, T., Lucey, S., 2013. Multi-channel correlation filters. In: IEEE
International Conference on Computer Vision. pp. 3072–3079.
Guo, Q., Feng, W., Zhou, C., Huang, R., Wan, L., Wang, S., 2017. Learning dynamic
Kiani Galoogahi, H., Sim, T., Lucey, S., 2015. Correlation filters with limited boundaries.
siamese network for visual object tracking. In: IEEE International Conference on
In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 4630–4638.
Computer Vision. pp. 1781–1789.
Kingma, D.P., Ba, J., 2021. An image is worth 16x16 words, What is a video
Guo, D., Shao, Y., Cui, Y., Wang, Z., Zhang, L., Shen, C., 2021. Graph attention tracking.
worth? arXiv:1412.6980.
arXiv:2011.11204.
Konda, V.R., Tsitsiklis, J.N., 2000. Actor-critic algorithms. In: Advances in Neural
Guo, D., Wang, J., Cui, Y., Wang, Z., Chen, S., 2020. SiamCAR: Siamese fully
Information Processing Systems. pp. 1008–1014.
convolutional classification and regression for visual tracking. In: IEEE Conference
Kosiorek, A., Bewley, A., Posner, I., 2017. Hierarchical attentive recurrent tracking. In:
on Computer Vision and Pattern Recognition. pp. 6269–6277.
Advances in Neural Information Processing Systems. pp. 3053–3061.
Hager, G.D., Belhumeur, P.N., 1998. Efficient region tracking with parametric models of
Kristan, M., Eldesokey, A., et al., 2017. The visual object tracking VOT2017 challenge
geometry and illumination. IEEE Trans. Pattern Anal. Mach. Intell. 20, 1025–1039.
results. In: IEEE International Conference on Computer Vision Workshop. pp.
Han, W., Huang, H., Yu, X., 2021. TAPL: Dynamic part-based visual tracking via
1949–1972.
attention-guided part localization. In: British Machine Vision Conference.
Kristan, M., Leonardis, A., Matas, J., et al., 2018. The sixth visual object tracking
Han, B., Sim, J., Adam, H., 2017. Branchout: Regularization for online ensemble
VOT2018 challenge results. In: European Conference on Computer Vision.
tracking with convolutional neural networks. In: IEEE International Conference on
Kristan, M., Leonardis, A., et al., 2016. The visual object tracking VOT2016 challenge
Computer Vision. pp. 2217–2224.
results. In: European Conference on Computer Vision Workshops, Vol. 8926, pp.
Hare, S., Golodetz, S., Saffari, A., Vineet, V., Cheng, M.-M., Hicks, S.L., Torr, P.H.,
191–217.
2016. Struck: Structured output tracking with kernels. IEEE Trans. Pattern Anal.
Kristan, M., Matas, J., Leonardis, A., et al., 2015. The visual object tracking VOT2015
Mach. Intell. 38, 2096–2109.
challenge results. In: IEEE International Conference on Computer Vision Workshops.
He, K., Gkioxari, G., Dollár, P., Girshick, R., 2017. Mask R-CNN. In: IEEE International
pp. 1–23.
Conference on Computer Vision. pp. 2961–2969.
Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. Imagenet classification with deep
He, A., Luo, C., Tian, X., Zeng, W., 2018. A twofold Siamese network for real-time convolutional neural networks. In: Advances in Neural Information Processing
object tracking. In: IEEE Conference on Computer Vision and Pattern Recognition. Systems. pp. 1097–1105.
pp. 4834–4843. Kwon, J., Lee, K.M., Park, F.C., 2009. Visual tracking via geometric particle filtering
He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition. on the affine group with optimal importance functions. In: IEEE Conference on
In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 770–778. Computer Vision and Pattern Recognition. pp. 991–998.
Held, D., Thrun, S., Savarese, S., 2016. Learning to track at 100 Fps with deep Ledig, C., Theis, L., Huszar, F., Caballero, J., Cunningham, A., et al., 2017. Photo-
regression networks. In: European Conference on Computer Vision. pp. 749–765. realistic single image super-resolution using a generative adversarial network. In:
Henriques, J.o.F., Caseiro, R., Martins, P., Batista, J., 2012. Exploiting the circu- IEEE Conference on Computer Vision and Pattern Recognition. pp. 4681–4690.
lant structure of tracking-by-detection with kernels. In: European Conference on Li, P., Chen, B., Ouyang, W., Wang, D., Yang, X., Lu, H., 2019a. GradNet: Gradient-
Computer Vision. pp. 702–715. guided network for visual object tracking. In: IEEE International Conference on
Henriques, J.F., Caseiro, R., Martins, P., Batista, J., 2015. High-speed tracking with Computer Vision. pp. 6162–6171.
kernelized correlation filters. IEEE Trans. Pattern Anal. Mach. Intell. 37, 583–596. Li, A., Lin, M., Wu, Y., Yang, M.H., Yan, S., 2016a. NUS-PRO: A new visual tracking
Hochreiter, S., Schmidhuber, J., 1997. Long short-term memory. Neural Comput. 9, challenge. IEEE Trans. Pattern Anal. Mach. Intell. 38, 335–349.
1735–1780. Li, X., Shen, C., Dick, A., Zhang, Z.M., Zhuang, Y., 2016b. Online metric-weighted
Hong, Z., Chen, Z., Wang, C., Mei, X., Prokhorov, D., Tao, D., 2015. MUlti-store Tracker linear representations for robust visual tracking. IEEE Trans. Pattern Anal. Mach.
(MUSTer): A cognitive psychology inspired approach to object tracking. In: IEEE Intell. 38, 931–950.
Conference on Computer Vision and Pattern Recognition. pp. 749–758. Li, M., Tan, T., Chen, W., Huang, K., 2012. Efficient object tracking by incremental
Hong, Z., Mei, X., Prokhorov, D., Tao, D., 2013. Tracking via robust multi-task multi- self-tuning particle filtering on the affine group. IEEE Trans. Image Process. 21,
view joint sparse representation. In: IEEE International Conference on Computer 1298–1313.
Vision. pp. 649–656. Li, F., Tian, C., Zuo, W., Zhang, L., Yang, M.-H., 2018a. Learning spatial-temporal
Horn, B.K.P., Schunck, B.G., 1981. Determining optical flow. Artificial Intelligence 17, regularized correlation filters for visual tracking. In: IEEE Conference on Computer
185–203. Vision and Pattern Recognition. pp. 4904–4913.
Hua, Y., Alahari, K., Schmid, C., 2015. Online object tracking with proposal selection. Li, P., Wang, D., Wang, L., Lu, H., 2018b. Deep visual tracking: Review and
In: IEEE International Conference on Computer Vision. pp. 3092–3100. experimental comparison. Pattern Recognit. 76, 323–338.
Huang, C., Lucey, S., Ramanan, D., 2017. Learning policies for adaptive tracking with Li, B., Wu, W., Wang, Q., Zhang, F., Xing, J., Yan, J., 2019b. Siamrpn++: Evolution of
deep feature cascades. In: IEEE International Conference on Computer Vision. pp. Siamese visual tracking with very deep networks. In: IEEE Conference on Computer
105–114. Vision and Pattern Recognition. pp. 4282–4291.
40
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508
Li, B., Xie, W., Zeng, W., Liu, W., 2019c. Learning to update for object tracking with Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, U., Shazeer, N., Ku, A., Tran, D., 2018.
recurrent meta-learner. IEEE Trans. Image Process. 28, 3624–3635. Image transformer. In: International Conference on Machine Learning.
Li, B., Yan, J., Wu, W., Zhu, Z., Hu, X., 2018c. High performance visual tracking with Qi, Y., Zhang, S., Qin, L., Yao, H., Huang, Q., Lim, J., Yang, M.-H., 2016. Hedged
Siamese region proposal network. In: IEEE Conference on Computer Vision and deep tracking. In: IEEE Conference on Computer Vision and Pattern Recognition.
Pattern Recognition. pp. 8971–8980. pp. 4303–4311.
Li, Y., Zhu, J., 2014. A scale adaptive kernel correlation filter tracker with feature Ramachandran, P., Parmar, N., Vaswani, A., Bello, I., Levskaya, A., Shlens, J., 2019.
integration. In: European Conference on Computer Vision. pp. 254–265. Stand-alone self-attention in vision models. arXiv:1906.05909.
Liang, P., Blasch, E., Ling, H., 2015. Encoding color information for visual tracking: Real, E., Shlens, J., Mazzocchi, S., et al., 2017. YouTube-BoundingBoxes: A large
Algorithms and benchmark. IEEE Trans. Image Process. 24, 5630–5644. high-precision human-annotated data set for object detection in Video. In: IEEE
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P., 2017. Focal loss for dense object Conference on Computer Vision and Pattern Recognition. pp. 5296–5305.
detection. IEEE Trans. Pattern Anal. Mach. Intell. PP, 2999–3007. Ren, S., He, K., Girshick, R., Sun, J., 2015. Faster R-CNN: Towards real-time object
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., detection with region proposal networks. In: Advances in Neural Information
Zitnick, C.L., 2014. Microsoft Coco: Common objects in context. In: European Processing Systems. pp. 91–99.
Conference on Computer Vision. pp. 740–755. Ren, L., Yuan, X., Lu, J., Yang, M., Zhou, J., 2018. Deep reinforcement learning with
Liu, F., Gong, C., Huang, X., Zhou, T., Yang, J., Tao, D., 2018. Robust visual tracking iterative shift for visual tracking. In: European Conference on Computer Vision. pp.
revisited: From correlation filter to template matching. IEEE Trans. Image Process. 684–700.
27, 2777–2790. Ross, D.A., Lim, J., et al., 2008. Incremental learning for robust visual tracking. Int. J.
Liu, T., Wang, G., Yang, Q., 2015. Real-time part-based visual tracking via adaptive Comput. Vis. 77, 125–141.
correlation filters. In: IEEE Conference on Computer Vision and Pattern Recognition. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z.,
pp. 4902–4912. Karpathy, A., Khosla, A., Bernstein, M., et al., 2015. Imagenet large scale visual
Lowe, D.G., 2004. Distinctive image features from scale-invariant keypoints. Int. J. recognition challenge. Int. J. Comput. Vis. 115, 211–252.
Comput. Vis. 60, 91–110. Sharir, G., Noy, A., Zelnik-Manor, L., 2021. An image is worth 16x16 words, What is
Lu, X., Ma, C., Ni, B., Yang, X., Reid, I., Yang, M.-H., 2018. Deep regression tracking a Video worth? arXiv:2103.13915.
with shrinkage loss. In: European Conference on Computer Vision. pp. 353–369. Smeulders, A.W., Chu, D.M., Cucchiara, R., Calderara, S., Dehghan, A., Shah, M., 2014.
Lukežič, A., Vojíř, T., Zajc, L.Č., Matas, J., Kristan, M., 2017. Discriminative correlation Visual tracking: An experimental survey. IEEE Trans. Pattern Anal. Mach. Intell. 36,
filter with channel and spatial reliability. In: IEEE Conference on Computer Vision 1442–1468.
and Pattern Recognition. pp. 4847–4856. Song, Y., Ma, C., Gong, L., Zhang, J., Lau, R.W., Yang, M.-H., 2017. Crest: Convolutional
Lukežič, A., Matas, J., Kristan, M., 2020. D3S-A discriminative single shot segmentation residual learning for visual tracking. In: IEEE International Conference on Computer
tracker. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. Vision. pp. 2574–2583.
7133–7142. Song, Y., Ma, C., Wu, X., Gong, L., Bao, L., Zuo, W., Shen, C., Lau, R.W.H., Yang, M.-
Lukežič, L.Č., Vojíř, T., Matas, J., Kristan, M., 2021. Performance evaluation H., 2018. VITAL: VIsual tracking via adversarial learning. In: IEEE Conference on
methodology for long-term single-object tracking. IEEE Trans. Cybern. 51, Computer Vision and Pattern Recognition. pp. 8990–8999.
6305–6318.
Song, H., Zheng, Y., Zhang, K., 2016. Robust visual tracking via self-similarity learning.
Lukežič, A., Zajc, L.Č., Vojíř, T., Matas, J., Kristan, M., 2018. Now you see me:
Electron. Lett. 53, 20–22.
Evaluating performance in long-term visual tracking. arXiv:1804.07056.
Srinivas, A., Lin, T.-Y., Parmar, N., Shlens, J., Abbeel, P., Vaswani, A., 2021. Bottleneck
Ma, C., Huang, J.-B., Yang, X., Yang, M.-H., 2015a. Hierarchical convolutional features
transformers for visual recognition. arXiv:2101.11605.
for visual tracking. In: IEEE International Conference on Computer Vision. pp.
Sui, Y., Tang, Y., Zhang, L., 2015. Discriminative low-rank tracking. In: IEEE
3074–3082.
International Conference on Computer Vision. pp. 3002–3010.
Ma, C., Huang, J.-B., Yang, X., Yang, M.-H., 2018. Adaptive correlation filters with
Sui, Y., Tang, Y., Zhang, L., Wang, G., 2018. Visual tracking via subspace learning: A
long-term and short-term memory for object tracking. Int. J. Comput. Vis. 1–26.
discriminative approach. Int. J. Comput. Vis. 126, 515–536.
Ma, L., Lu, J., Feng, J., Zhou, J., 2015b. Multiple feature fusion via weighted entropy
Sun, C., Wang, D., Lu, H., Yang, M.-H., 2018a. Correlation tracking via joint discrimina-
for visual tracking. In: IEEE International Conference on Computer Vision. pp.
tion and reliability learning. In: IEEE Conference on Computer Vision and Pattern
3128–3136.
Recognition. pp. 489–497.
Ma, C., Yang, X., Zhang, C., Yang, M.-H., 2015c. Long-term correlation tracking. In:
Sun, C., Wang, D., Lu, H., Yang, M.-H., 2018b. Learning spatial-aware regressions for
IEEE Conference on Computer Vision and Pattern Recognition. pp. 5388–5396.
visual tracking. In: IEEE Conference on Computer Vision and Pattern Recognition.
Marvasti-Zadeh, S.M., Cheng, L., Ghanei-Yakhdan, H., Kasaei, S., 2021. Deep learning
pp. 8962–8970.
for visual tracking: A comprehensive survey. IEEE Trans. Intell. Transp. Syst..
Supančič III, J., Ramanan, D., 2017. Tracking as online decision-making: Learning a
Mei, X., Ling, H., 2009. Robust Visual Tracking Using 𝓁1 Minimization. In: IEEE
policy from streaming videos with reinforcement learning. In: IEEE International
International Conference on Computer Vision. pp. 1436–1443.
Conference on Computer Vision. pp. 322–331.
Mei, X., Ling, H., Wu, Y., Blasch, E., Bai, L., 2011. Minimum error bounded efficient
Sutton, R.S., Barto, A.G., 1998. Introduction to Reinforcement Learning, Vol. 135.
𝓁1 tracker with occlusion detection. In: IEEE Conference on Computer Vision and
Pattern Recognition. pp. 1257–1264. Sutton, R.S., McAllester, D.A., Singh, S.P., Mansour, Y., 2000. Policy gradient methods
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., for reinforcement learning with function approximation. In: Advances in Neural
Riedmiller, M., 2013. Playing atari with deep reinforcement learning. arXiv:1312. Information Processing Systems. pp. 1057–1063.
5602. Tang, M., Feng, J., 2015. Multi-kernel correlation filter for visual tracking. In: IEEE
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., International Conference on Computer Vision. pp. 3038–3046.
Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., et al., 2015. Human-level Tao, R., Gavves, E., Smeulders, A.W.M., 2016. Siamese instance search for tracking. In:
control through deep reinforcement learning. Nature 518, 529. IEEE Conference on Computer Vision and Pattern Recognition. pp. 1420–1429.
Moudgil, A., Gandhi, V., 2017. Long-term visual object tracking benchmark. arXiv: Teng, Z., Xing, J., Wang, Q., Lang, C., Feng, S., Jin, Y., et al., 2017. Robust object
1712.01358. tracking based on temporal and spatial deep networks. In: IEEE International
Mueller, M., Bibi, A., Giancola, S., Alsubaihi, S., Ghanem, B., 2018. TrackingNet: A Conference on Computer Vision. pp. 1153–1162.
large-scale dataset and benchmark for object tracking in the wild. In: European Tian, Z., Shen, C., Chen, H., He, T., 2020. FCOS: Fully convolutional one-stage object
Conference on Computer Vision. detection. In: International Conference on Computer Vision. pp. 9627–9636.
Mueller, M., Smith, N., Ghanem, B., 2016. A benchmark and simulator for UAV Tsochantaridis, I., Joachims, T., Hofmann, T., Altun, Y., 2005. Large margin methods for
tracking. In: European Conference on Computer Vision. pp. 445–461. structured and interdependent output variables. J. Mach. Learn. Res. 6, 1453–1484.
Mueller, M., Smith, N., Ghanem, B., 2017. Context-aware correlation filter tracking. In: Ungerleider, L.G., Kastner, S., 2000. Mechanisms of visual attention in the human
IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, p. 6. cortex. Annu. Rev. Neurosci. 23, 315–341.
Nam, H., Han, B., 2016. Learning Multi-domain Convolutional Neural Networks for Valmadre, J., Bertinetto, L., Henriques, J., Vedaldi, A., Torr, P.H., 2017. End-to-end
Visual Tracking. In: IEEE Conference on Computer Vision and Pattern Recognition. representation learning for correlation filter based tracking. In: IEEE Conference
pp. 4293–4302. on Computer Vision and Pattern Recognition. pp. 5000–5008.
Newell, A., Yang, K., Deng, J., 2016. Stacked hourglass networks for human pose Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł.,
estimation. In: European Conference on Computer Vision. Springer, pp. 483–499. Polosukhin, I., 2017. Attention is all you need. In: Advances in Neural Information
Nguyen, H.T., Smeulders, A.W.M., 2004. Fast occluded object tracking by a robust Processing Systems, vol. 30.
appearance filter. IEEE Trans. Pattern Anal. Mach. Intell. 26, 1099–1104. Viola, P., Jones, M., 2001. Rapid object detection using a boosted cascade of simple
Nguyen, H.T., Smeulders, A.W.M., 2006. Robust tracking using foreground-background features. In: IEEE Conference on Computer Vision and Pattern Recognition, vol. 1,
texture discrimination. Int. J. Comput. Vis. 69, 277–293. p. I.
Ning, J., Yang, J., Jiang, S., Zhang, L., Yang, M.-H., 2016. Object tracking via dual Voigtlaender, P., Luiten, J., Torr, P.H.S., Leibe, B., 2020. Siam R-CNN: Visual tracking
linear structured SVM and explicit feature map. In: IEEE Conference on Computer by re-detection. In: IEEE Conference on Computer Vision and Pattern Recognition.
Vision and Pattern Recognition. pp. 4266–4274. pp. 6577–6587.
Park, E., Berg, A.C., 2018. Meta-tracker: Fast and robust online adaptation for visual Wang, Q., Gao, J., et al., 2017. DCFNet: Discriminant correlation filters network for
object trackers. In: European Conference on Computer Vision. pp. 569–585. visual tracking. arXiv:1704.04057.
41
F. Chen, X. Wang, Y. Zhao et al. Computer Vision and Image Understanding 222 (2022) 103508
Wang, X., Hou, Z., Yu, W., Pu, L., Jin, Z., Qin, X., 2018a. Robust occlusion-aware Zhang, K., Fan, J., Liu, Q., et al., 2018a. Parallel attentive correlation tracking. IEEE
part-based visual tracking with object scale adaptation. Pattern Recognit. 81, Trans. Image Process. 28, 479–491.
456–470. Zhang, T., Ghanem, B., Liu, S., Ahuja, N., 2012a. Robust visual tracking via multi-task
Wang, X., Li, C., Yang, R., Zhang, T., Tang, J., Luo, B., 2018b. Describe and attend sparse learning. In: IEEE Conference on Computer Vision and Pattern Recognition.
to track: Learning natural language guided structural representation and visual pp. 2042–2049.
attention for object tracking. arXiv:1811.10014. Zhang, L., Gonzalezgarcia, A., De Weijer, J.V., Danelljan, M., Khan, F.S., 2019a.
Wang, D., Lu, H., Yang, M.-H., 2013. Online object tracking with sparse prototypes. Learning the model update for Siamese trackers. In: IEEE International Conference
IEEE Trans. Image Process. 22, 314–325. on Computer Vision. pp. 4010–4019.
Wang, G., Luo, C., Sun, X., Xiong, Z., Zeng, W., 2020. Tracking by instance detection: Zhang, T., Jia, K., Xu, C., Ma, Y., Ahuja, N., 2014a. Partial occlusion handling for
A meta-learning approach. In: IEEE Conference on Computer Vision and Pattern visual tracking via robust part matching. In: IEEE Conference on Computer Vision
Recognition. pp. 6287–6296. and Pattern Recognition. pp. 1258–1265.
Wang, G., Luo, C., Xiong, Z., Zeng, W., 2019a. SPM-tracker: Series-parallel matching Zhang, S., Lan, X., Yao, H., Zhou, H., Tao, D., Li, X., 2017a. A biologically inspired
for real-time visual object tracking. In: IEEE Conference on Computer Vision and appearance model for robust visual tracking. IEEE Trans. Neural Netw. Learn. Syst.
Pattern Recognition. 28, 2357–2370.
Wang, L., Ouyang, W., Wang, X., Lu, H., 2015. Visual tracking with fully con- Zhang, K., Li, X., Song, H., Liu, Q., Wei, L., 2018b. Visual tracking using
volutional networks. In: IEEE International Conference on Computer Vision. spatio-temporally nonlocally regularized correlation filter. Pattern Recognit. 83,
pp. 3119–3127. 185–195.
Wang, L., Ouyang, W., Wang, X., Lu, H., 2016. STCT: Sequentially training convolu- Zhang, T., Liu, S., Ahuja, N., Yang, M.-H., Ghanem, B., 2014b. Robust visual
tional networks for visual tracking. In: IEEE Conference on Computer Vision and tracking via consistent low-rank sparse learning. Int. J. Comput. Vis. 111,
Pattern Recognition. pp. 1373–1381. 171–190.
Wang, N., Song, Y., Ma, C., Zhou, W., Liu, W., Li, H., 2019b. Unsupervised deep Zhang, K., Liu, Q., Wu, Y., Yang, M.-H., 2016. Robust visual tracking via convolutional
tracking. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. networks without training. IEEE Trans. Image Process. 25, 1779–1792.
1308–1317. Zhang, T., Liu, S., Xu, C., Liu, B., Yang, M.-H., 2018c. Correlation particle filter for
Wang, Q., Teng, Z., Xing, J., Gao, J., et al., 2018. Learning attentions: Residual visual tracking. IEEE Trans. Image Process. 27, 2676–2687.
attentional Siamese network for high performance online visual tracking. In: IEEE Zhang, T., Liu, S., Xu, C., Yan, S., Ghanem, B., Ahuja, N., Yang, M.-H., 2015. Structural
Conference on Computer Vision and Pattern Recognition. pp. 4854–4863. sparse tracking. In: IEEE Conference on Computer Vision and Pattern Recognition.
Wang, A., Wan, G., Cheng, Z., Li, S., 2009. An incremental extremely random forest pp. 150–158.
classifier for online learning and tracking. In: IEEE International Conference on Zhang, K., Liu, Q., Yang, J., Yang, M.-H., 2018d. Visual tracking via Boolean map
Image Processing. pp. 1449–1452. representations. Pattern Recognit. 81, 147–160.
Wang, W., Xie, E., Li, X., Fan, D.-P., Song, K., Liang, D., Lu, T., Luo, P., Shao, L., 2021a. Zhang, J., Ma, S., Sclaroff, S., 2014c. MEEM: Robust tracking via multiple ex-
Pyramid vision transformer: A versatile backbone for dense prediction without perts using entropy minimization. In: European Conference on Computer Vision.
convolutions. arXiv:2102.12122. pp. 188–203.
Wang, N., Yeung, D.-Y., 2014. Ensemble-based tracking: Aggregating crowdsourced Zhang, Z., Peng, H., 2019. Deeper and wider Siamese networks for real-time visual
structured time series data. In: International Conference on Machine Learning. pp. tracking. In: IEEE Conference on Computer Vision and Pattern Recognition. pp.
1107–1115. 4591–4600.
Wang, N., Zhou, W., Wang, J., Li, H., 2021b. Transformer meets tracker: Exploiting Zhang, Z., Peng, H., Fu, J., Li, B., Hu, W., 2020. Ocean: Object-aware anchor-free
temporal context for robust visual tracking. In: IEEE Conference on Computer Vision tracking. In: European Conference on Computer Vision. pp. 771–787.
and Pattern Recognition. pp. 1571–1580. Zhang, C., Platt, J.C., Viola, P.A., 2006. Multiple instance boosting for object detection.
Wright, S., Nocedal, J., et al., 1999. Numerical optimization, vol. 35. Springer Science, In: Advances in Neural Information Processing Systems. pp. 1417–1424.
p. 7. Zhang, L., Suganthan, P.N., 2017. Robust visual tracking via co-trained kernelized
Wu, Y., Lim, J., Yang, M.-H., 0000. Online object tracking: A benchmark, http: correlation filters. Pattern Recognit. 69, 82–93.
//cvlab.hanyang.ac.kr/tracker_benchmark/benchmark_v10.html. Zhang, L., Varadarajan, J., Suganthan, P.N., Ahuja, N., Moulin, P., 2017b. Robust visual
Wu, Y., Lim, J., Yang, M.H., 2013. Online object tracking: A benchmark. In: IEEE tracking using oblique random forests. In: IEEE Conference on Computer Vision and
Conference on Computer Vision and Pattern Recognition. pp. 2411–2418. Pattern Recognition. pp. 5589–5598.
Wu, Y., Lim, J., Yang, M.-H., 2015. Object tracking benchmark. IEEE Trans. Pattern Zhang, Y., Wang, L., Qi, J., Wang, D., Feng, M., Lu, H., 2018e. Structured Siamese
Anal. Mach. Intell. 37, 1834–1848. network for real-time visual tracking. In: European Conference on Computer Vision.
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., Bengio, Y., pp. 351–366.
2015. Show, attend and tell: Neural image caption generation with visual attention. Zhang, Y., Wang, L., Wang, D., Qi, J., Lu, H., 2021. Learning regression and verification
In: International Conference on Machine Learning. pp. 2048–2057. networks for robust long-term tracking. Int. J. Comput. Vis..
Xu, T., Feng, Z.-H., Wu, X.-J., Kittler, J., 2019. Joint group feature selection and Zhang, M., Wang, Q., Xing, J., Gao, J., Peng, P., Hu, W., Maybank, S., 2018f. Visual
discriminative filter learning for robust visual object tracking. In: IEEE International tracking via spatially aligned correlation filters network. In: European Conference
Conference on Computer Vision, ICCV. pp. 7949–7959. on Computer Vision. pp. 469–485.
Xu, Y., Wang, Z., Li, Z., Yuan, Y., Yu, G., 2020. SiamFC++: Towards robust and accurate Zhang, T., Xu, C., Yang, M.-H., 2017c. Multi-task correlation particle filter for robust
visual tracking with target estimation guidelines. arXiv:1911.06188. object tracking. In: IEEE Conference on Computer Vision and Pattern Recognition,
Yan, B., Peng, H., Wu, K., Wang, D., Fu, J., Lu, H., 2021. LightTrack: Finding vol. 1, p. 3.
lightweight neural networks for object tracking via one-shot architecture search. Zhang, T., Xu, C., Yang, M.-H., 2019b. Learning multi-task correlation particle filters
arXiv:2104.14545. for visual tracking. IEEE Trans. Pattern Anal. Mach. Intell. 41, 365–378.
Yang, T., Chan, A.B., 2017. Recurrent filter learning for visual tracking. In: IEEE Zhang, S., Yao, H., Sun, X., Lu, X., 2013. Sparse coding based visual tracking: Review
International Conference on Computer Vision. pp. 2010–2019. and experimental comparison. Pattern Recognit. 46, 1772–1788.
Yang, T., Chan, A.B., 2018. Learning dynamic memory networks for object tracking. Zhang, K., Zhang, L., Yang, M.-H., 2012b. Real-time compressive tracking. In: European
In: European Conference on Computer Vision. pp. 152–167. Conference on Computer Vision. pp. 864–877.
Yang, K., Song, H., Zhang, K., Fan, J., 2019a. Deeper siamese network with multi-level Zhao, L., Zhao, Q., Chen, Y., Lv, P., 2016. Combined discriminative global and
feature fusion for real-time visual tracking. Electron. Lett. 55, 742–745. generative local models for visual tracking. J. Electron. Imaging 25, 023005.
Yang, K., Song, H., Zhang, K., Liu, Q., 2019b. Hierarchical attentive siamese network Zheng, L., Tang, M., Chen, Y., Wang, J., Lu, H., 2020. Learning Feature Embeddings for
for real-time visual tracking. Neural Comput. Appl. 1–12. Discriminant Model Based Tracking. In: European Conference on Computer Vision.
Yao, Y., Wu, X., Zhang, L., Shan, S., Zuo, W., 2018. Joint representation and truncated pp. 759–775.
inference learning for correlation filter based tracking. In: European Conference on Zhong, W., Lu, H., Yang, M.-H., 2012. Robust object tracking via sparsity-based collab-
Computer Vision. pp. 552–567. orative model. In: IEEE Conference on Computer Vision and Pattern Recognition.
Yilmaz, A., Javed, O., Shah, M., 2006. Object tracking: A survey. ACM Comput. Surv. pp. 1838–1845.
38, 13. Zhou, H., Fei, M., Sadka, A., Zhang, Y., Li, X., 2014. Adaptive fusion of particle
filtering and spatio-temporal motion energy for human tracking. Pattern Recognit.
Yu, Q., Dinh, T.B., Medioni, G., 2008. Online tracking and reacquisition using co-trained
47, 3552–3567.
generative and discriminative trackers. In: European Conference on Computer
Zhu, Z., Wang, Q., Li, B., Wei, W., Yan, J., 2018a. Distractor-aware Siamese networks
Vision. Springer, pp. 678–691.
for visual object tracking. In: European Conference on Computer Vision. pp.
Yu, Z., Xiang, B., Liu, W., Latecki, L.J., 2016. Similarity fusion for visual tracking. Int.
101–117.
J. Comput. Vis. 118, 337–363.
Zhu, Z., Wu, W., Zou, W., Yan, J., 2018b. End-to-end flow correlation tracking with
Yu, Y., Xiong, Y., Huangy, W., R. Scott, M., 2020. Deformable Siamese attention
spatial-temporal attention. In: IEEE Conference on Computer Vision and Pattern
networks for visual object tracking. In: IEEE Conference on Computer Vision and
Recognition. pp. 548–557.
Pattern Recognition. pp. 6727–6736.
Zhuang, B., Lu, H., Xiao, Z., Wang, D., 2014. Visual tracking via discriminative sparse
Yun, S., Choi, J., Yoo, Y., Yun, K., Choi, J.Y., 2017. Action-decision networks for visual
similarity map. IEEE Trans. Image Process. 23, 1872–1881.
tracking with deep reinforcement learning. In: IEEE Conference on Computer Vision
and Pattern Recognition. pp. 1349–1358.
42