TS2-Net: Token Shift and Selection Transformer For Text-Video Retrieval
TS2-Net: Token Shift and Selection Transformer For Text-Video Retrieval
{yuqi657,qjin}@ruc.edu.cn,
xiongpengfei2019@gmail.com, {lukenxu,devancao}@tencent.com
1 Introduction
Fig. 1. The text-video retrieval examples that require fine-grained video representation.
Left: the small object ‘hat’ is important for correctly retrieving the target video. Right:
the subtle movement of ‘talking’ is crucial for the correct retrieval of the target video.
Green boxes depict the positive video result, while red boxes are negative candidates
temporal patch contexts into encoded features. The shift operation is introduced
in TSM[29], which shifts parts of the channel along temporal dimension. Shift
Transformer[52] applies shift in visual transformer to enhance temporal model-
ing. However, the architecture of transformer is different from CNN, such partial
shift operation damages the completeness of each token representation.
Therefore, in this paper, we propose TS2-Net, a novel token shift and selec-
tion transformer network, to realize local patch feature enhancement. Specifi-
cally, we first adopt the token shift module in TS2-Net, which shifts the whole
spatial token features back-and-forth across adjacent frames, in order to capture
local movement between frames. We then design a token selection module to se-
lect top-K informative tokens to enhance the salient semantic feature modeling
capability. Our token shift module treats the features of each token as a whole,
and iteratively swaps token features at the same location with neighbor frames,
to preserve the complete local token representation and capture local temporal
semantics at the same time. The token selection module estimates the impor-
tance of each token feature of patches with a selection network, which relies on
the correlation between all spatial-temporal patch features and [CLS] tokens. It
then selects tokens which contributes most to local spatial semantics. Finally, we
align cross-modal representation in a fine-grained manner, where we calculate
the similarity between text and each frame-wise video embedding and aggregate
them together. TS2-Net is optimized with video-language contrastive learning.
We conduct extensive experiments on several text-video retrieval benchmarks
to evaluate our model, including MSRVTT, VATEX, LSMDC, ActivityNet, and
DiDeMo. Our proposed TS2-Net achieves the state-of-the-art performance on
most of the benchmarks. The ablation experiments demonstrate that the pro-
posed token shift and token selection modules both improve the fine-grained
text-video retrieval accuracy. The main contributions of this work are as follows:
2 Related Work
Various approaches have been proposed to deal with text-video retrieval task,
which usually consist of off-line feature extractors and feature fusion module
[50,32,21,17,11,31,14,45]. MMT[21] uses a cross-modal encoder to aggregate fea-
ture extracted by different experts. MDMMT[17] further utilizes knowledge
learned from multi-domain datasets. Recent works [26,4,35,19,12] attempt to
4 Y. Liu, P. Xiong, et al.
3 Method
The goal of text-video retrieval is to find the best matching videos based on
the text query. Fig.2 illustrates the overall structure of the proposed TS2-Net
model for the text-video retrieval task, which consists of three key components:
TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval 5
Fig. 2. Overview of the proposed TS2-Net model for text-video retrieval, which con-
sists of three key components: the text encoder, the video encoder, and the text-video
matching. The video encoder is composed of the Token Shift Transformer and Token
Selection Transformer. (‘Repre’ is short for ‘Representation’)
the text encoder, the video encoder, and the text-video matching. The text en-
coder encodes the sequence of query words into a query representation q. In this
paper, we use GPT [39] model as the text encoder. By adding a special token
[EOS] at the end of query word sequence, we employ the encoding of [EOS]
by the GPT encoder as the query representation q. The video encoder encodes
the sequence of video frames into a sequence of frame-wise video representation
v = {f1 , f2 , . . . , ft }. Based on the query and video representation, q and v, the
text-video matching computes the cross-modal similarity between the query and
video candidate. In following sections, we first elaborate the core ingredients
of our video encoder, namely the token shift transformer (Sec.3.1) and the to-
ken selection transformer (Sec.3.2), and finally present our text-video matching
strategy in details (Sec.3.3).
Fig. 3. Illustration of different types of Shift operation and our proposed Token Tempo-
ral Shift. ‘T, P, C’ refer to video temporal dimension, video token, and feature channel
respectively. Each vertical cube group represents a spatial-temporal video token. Cubes
with dash line represent tensor truncated, and white cubes represent tensor padding.
In Shift-Transformer [52], tokens are shifted along the channel dimension, while our
proposed Token Shift Module does not compromise the integrity of a video token
Shift-Transformer [52] has also explored several shift variants on the visual
transformer architecture. Fig.3 visualizes the difference between these shift vari-
ants and our proposed token shift. A naive channel temporal shift swaps part of
channels of a frame tensor along temporal dimension, as shown in Fig. 3(a). Shift-
Transformer [52] also presents [VIS] channel temporal shift and [CLS] channel
temporal shift, as shown in Fig.3(b)(c). They fix tensor in token dimension and
shift parts of channels for chosen token along the temporal dimension. Different
from these works, our token shift transformer emphasizes the token dimension,
where we shift whole channels of a token back-and-forth across adjacent frames,
as shown in Fig.3(d). We believe our token shift is better for ViT architecture,
because different from the CNN architecture, each token in ViT is independent
and contains unique spatial information with respect to its location. Thus shift-
ing parts of channels destroys the integrity of the information contained in a
token. On the contrast, shifting a whole token with all channels can preserve
complete information contained in a token and enable cross-frame interaction.
However, if we shift most of the tokens in every ViT layer, it damages the spa-
tial modeling ability, and the information contained in these tokens is no longer
accessible in the current frame. We therefore use a residual connection between
original feature and token shift feature, as illustrated in Fig.2. In addition, we
assume that shallow layers are more important to model spatial features, so
shifting in shallow layers could harm spatial modeling. We thus choose to apply
token shift operation only in deeper layers in our implementation.
TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval 7
Fig. 4. Illustration of Token Selection Module. Top-K informative tokens are selected
per frame from original spatial-temporal tokens for following feature aggregation
We finally feed all the concatenated token features to another MLP followed by
a Softmax layer to predict the importance scores, which can be formulated as:
\begin {aligned} \label {eq:topk} \boldsymbol {S}=\operatorname {Softmax}(\operatorname {MLP}({\hat {p}})) \in \mathbb {R}^{(N+1)}. \end {aligned} (1)
We select indices of K most informative tokens based on S, denoting as M ∈
{0, 1}(N +1)×K , where each column in M is a one-hot (N + 1) dimensional indi-
cator. We extract top-K most informative tokens by:
\begin {aligned} \label {eq:tokenselect} \mathbf {\hat {I}} = \mathbf {M}^{T} \mathbf {I} , \end {aligned} (2)
After top-K token select on every frame, we input the selected tokens from all
frames to a joint spatial-temporal transformer, to learn global spatial-temporal
video representation. We also pick the most informative token from each frame
as the frame-wise video encoding.
Differentiable TopK. Until now, both top-K operation and one-hot opera-
tion are non-differentiable. To make token selection module differentiable, we
8 Y. Liu, P. Xiong, et al.
\label {eq:difftopk} F(\boldsymbol {S})=\max _{\mathbf {M} \in \mathcal {C}}\langle \mathbf {M} , \boldsymbol {S}\rangle , \mathbf {M}^{*}(\boldsymbol {S})=\underset {\mathbf {M} \in \mathcal {C}}{\arg \max }\langle \mathbf {M} , \boldsymbol {S}\rangle , (3)
where F (S) represents the top-K selection operation, M∗ (S) represents the op-
timal value. Based on Eq.3, we can select top-K informative tokens by F (S). We
calculate forward and backward pass following [1,13].
where αi = Pnexp(λs i)
and λ is a temperature parameter. We set λ as 4
i=1 exp(λsi )
empirically in our experiments.
Symmetric cross-entropy loss is adopted as our training objective function.
For each training step with B text-video pairs, we calculate symmetric cross-
entropy loss as follows:
\mathcal {L}_{t}^{t 2 v} =-\frac {1}{B} \sum _{i}^{B} \log \frac {\exp \left (\tau \cdot \operatorname {sim}\left (q_{i}, v_{i}\right )\right )}{\sum _{j=1}^{B} \exp \left (\tau \cdot \operatorname {sim}\left (q_{i}, v_{j}\right )\right )} , (6)
\mathcal {L}_{t}^{v 2 t} =-\frac {1}{B} \sum _{i}^{B} \log \frac {\exp \left (\tau \cdot \operatorname {sim}\left (q_{i}, v_{i}\right )\right )}{\sum _{j=1}^{B} \exp \left (\tau \cdot \operatorname {sim}\left (q_{j}, v_{i}\right )\right )} , (7)
where τ is a trainable scaling parameter and sim (q, v) is calculated using Eq.5.
During inference, we calculate the matching score between each text and video
based on Eq.5, and return videos with the highest ranking.
TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval 9
4 Experiment
In this section, we carry out text-video retrieval evaluations on multiple bench-
mark datasets to validate our proposed model TS2-Net. We first ablate the core
ingredients of our video encoder, the token shift transformer and the token se-
lection transformer, on the dominant MSR-VTT dataset. We then compare our
model with other state-of-the-art models on multiple benchmark datasets quan-
titatively and qualitatively.
Table 1. Performance comparison with different parameter settings of the Token Shift
Transformer on MSR-VTT-1k-A test split
Table 2. Performance comparison between other shift operation variants and our
proposed token shift module on MSR-VTT-1k-A test split
Implementation Details. The layer of GPT, token shift transformer and token
selection transformer is 12, 12 and 4, respectively. The dimension of text em-
bedding and frame embedding is 512. We initialize transformer layers in GPT,
token shift transformer and token selection transformer with pre-trained weight
from CLIP(ViT-B/32)[38], using parameters with similar dimension, while other
modules are initialized randomly. We choose 4 most informative tokens in MSR-
VTT, VATEX, ActivityNet-Caption, DiDeMo, and 1 in LSMDC. We set the
max query text length as 32 and max video frame length as 12 in MSR-VTT,
VATEX, LSMDC. For ActivityNet-Caption and DiDeMo, we set the max query
text length and max video frame length as 64. We train our model with Adam[24]
optimizer and adopt a warmup[23] setting. We choose a batch size of 128. The
learning rate of GPT and token shift transformer is 1e-7 and the learning rate
of token selection transformer is 1e-4.
In this section, we evaluate the proposed token shift transformer and token
selection transformer under different settings to validate their effectiveness. We
conduct ablation experiments with the 1k-A test split on MSR-VTT[48]. We set
our baseline model as the degraded TS2-Net model which removes the token
shift and token selection modules from TS2-Net.
TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval 11
Fig. 5. The text-video retrieval results of different network architecture. Left: with to-
ken shift transformer, our model is able to distinguish ‘shake hands’, while the baseline
model retrieves an incorrect video. Right: with token selection transformer, our model
retrieves the correct video, although ‘bag’ is only shown in small part of video frames.
Green boxes: correct target video; red boxes: incorrect target video
ing fewer tokens per frame tends to achieve better performance than selecting
more. For example, the R@1 performance decreases from 47.0 to 45.8 while the
number of selected tokens increases from 2 to 50. We consider that fewer in-
formative tokens are sufficient to preserve the salient spatial information, while
adding more tokens may bring redundancy problem. Although random selection
also improves the performance slightly, it can not beat the proposed learnable
token selection module. In Fig.5(b), we show a retrieval case from the baseline
model and the model with token selection transformer. With token selection
transformer, the model is able to capture the small object ‘bag’ in video frames.
VATEX LSMDC
Method R@1 R@5 R@10 MdR MeanR Method R@1 R@5 R@10 MdR MeanR
Dual Enc.[15] 31.1 67.5 78.9 3.0 - JSFusion[50] 9.1 21.2 34.1 36.0 -
HGR[11] 35.1 73.5 83.5 2.0 - CE[32] 11.2 26.9 34.9 25.3 -
CLIP[38] 39.7 72.3 82.2 2.0 12.8 Frozen[4] 15.0 30.8 39.8 20.0 -
CLIP4Clip[35] 55.9 89.2 95.0 1.0 3.9 CLIP4Clip[35] 22.6 41.0 49.1 11.0 61.0
QB-Norm*[7] 58.8 88.3 93.8 1.0 - QB-Norm*[7] 22.4 40.1 49.5 11.0 -
CLIP2Video[19] 57.3 90.0 95.5 1.0 3.6 CAMoE[12] 22.5 42.6 50.9 - 56.5
TS2-Net 59.1 90.0 95.2 1.0 3.5 TS2-Net 23.4 42.3 50.9 9.0 56.9
ActivityNet DiDeMo
Method R@1 R@5 R@10 MdR MeanR Method R@1 R@5 R@10 MdR MeanR
CE[32] 20.5 47.7 63.9 6.0 23.1 ClipBERT[26] 20.4 48.0 60.8 6.0 -
ClipBERT[26] 21.3 49.0 63.5 6.0 - TT-CE[14] 21.1 47.3 61.1 6.3 -
MMT-Pretrained[21] 28.7 61.4 - 3.3 16.0 Frozen[4] 31.0 59.8 72.4 3.0 -
CLIP4Clip[35] 40.5 73.4 - 2.0 7.5 CLIP4Clip[35] 42.5 70.2 80.6 2.0 17.5
TS2-Net 41.0 73.6 84.5 2.0 8.4 TS2-Net 41.8 71.6 82.0 2.0 14.8
test set. Our model outperforms previous methods across different evaluation
metrics. With token shift transformer and token selection transformer, our model
is able to capture subtle motion and salient objects, and thus our final video
representation contains rich semantics. Compared with video-to-text retrieval,
the gain on text-to-video retrieval is more significant. We consider it is because
the proposed token shift and token selection modules enhance the video encoder,
while a relative simple text encoder is adopted.
Other Benchmarks. Tab.5 presents text-to-video retrieval results on VATEX,
LSMDC, ActivityNet-Caption and DiDeMo. Results on these datasets demon-
strate the generalization and robustness of our proposed model. We can observe
that our model achieves consistent improvements across different datasets, which
demonstrates that it is beneficial to encode spatial and temporal features simul-
taneously by our token shift and token selection. Note that our performance
surpasses QB-Norm[7] on LSMDC and VATEX even without inverted softmax,
as shown in Tab.5. More detailed analysis will be provided in supplementary
materials.
We visualize some retrieval examples from the MSR-VTT testing set for text-to-
video retrieval in Fig.6. In the top left example, our model is able to distinguish
‘hand rubbing’ (in the middle picture) during a guitar-playing scene. The bot-
tom right example shows our model can distinguish ‘computer battery’ from
‘computer’. In the bottom left example, our model retrieves the correct video
which contains all actions and objects expressed in the text query, especially
14 Y. Liu, P. Xiong, et al.
the small object ‘microphone’ and tiny movement ‘talking’. In the bottom right
example, our model retrievals the correct result although ‘rotating’ is a periodic
movement and is hard to spot.
We also select a subset from the MSR-VTT-1kA test set. Queries in this
subset are selected based on their corresponding video’s visual appearance, where
objects mentioned in query are shown in a small part of video and movements
mentioned in query is slight. Such as ‘little pet shop cat getting a bath and washed
with little brush’, ‘man talks in front of a green bicycle’, ‘dog is drinking milk
with baby nibble bottle’, ‘a golf player is trying to hit the ball into the pit’. Since
such cases account for a small proportion, so the total number of this subset
is 103. During inference, we calculate similarity between queries in subset with
videos in whole test set. We compare our model with another strong baseline on
this subset. Our model achieves 79.6 on R@1 metric, while CLIP4Clip[35] only
achieves 39.8. There is a significant margin and this verifies the effectiveness of
TS2-Net in handling local subtle movements and local small entities.
5 Conclusion
In this work, we propose Token Shift and Selection Network (TS2-Net), a noval
transformer architecture with token shift and selection modules, which aims to
further improve the video encoder for better video representation. A token shift
transformer is used to capture subtle movements, followed by a token selection
transformer to enhance salient object modeling ability. Superior experimental
results show our proposed TS2-Net outperforms start-of-the-art methods on
five text-video retrieval benchmarks, including MSR-VTT, VATEX, LSMDC,
ActivityNet-Caption and DiDeMo.
TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval 15
References
1. Abernethy, J., Lee, C., Tewari, A.: Perturbation techniques in online learning and
optimization. Perturbations, Optimization, and Statistics p. 223 (2016)
2. Anne Hendricks, L., Wang, O., Shechtman, E., Sivic, J., Darrell, T., Russell, B.:
Localizing moments in video with natural language. In: Proceedings of the IEEE
international conference on computer vision. pp. 5803–5812 (2017)
3. Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: Vivit: A
video vision transformer. In: Proceedings of the IEEE/CVF International Confer-
ence on Computer Vision. pp. 6836–6846 (2021)
4. Bain, M., Nagrani, A., Varol, G., Zisserman, A.: Frozen in time: A joint video
and image encoder for end-to-end retrieval. In: Proceedings of the IEEE/CVF
International Conference on Computer Vision. pp. 1728–1738 (2021)
5. Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for
video understanding. arXiv preprint arXiv:2102.05095 (2021)
6. Berthet, Q., Blondel, M., Teboul, O., Cuturi, M., Vert, J.P., Bach, F.: Learning
with differentiable pertubed optimizers. Advances in neural information processing
systems pp. 9508–9519 (2020)
7. Bogolin, S.V., Croitoru, I., Jin, H., Liu, Y., Albanie, S.: Cross modal retrieval with
querybank normalisation. arXiv preprint arXiv:2112.12777 (2021)
8. Bulat, A., Perez Rua, J.M., Sudhakaran, S., Martinez, B., Tzimiropoulos, G.:
Space-time mixing attention for video transformer. Advances in Neural Informa-
tion Processing Systems (2021)
9. Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the
kinetics dataset. In: proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition. pp. 6299–6308 (2017)
10. Chen, D., Dolan, W.B.: Collecting highly parallel data for paraphrase evaluation.
In: Proceedings of the 49th annual meeting of the association for computational
linguistics: human language technologies. pp. 190–200 (2011)
11. Chen, S., Zhao, Y., Jin, Q., Wu, Q.: Fine-grained video-text retrieval with hierar-
chical graph reasoning. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition. pp. 10638–10647 (2020)
12. Cheng, X., Lin, H., Wu, X., Yang, F., Shen, D.: Improving video-text re-
trieval by multi-stream corpus alignment and dual softmax loss. arXiv preprint
arXiv:2109.04290 (2021)
13. Cordonnier, J.B., Mahendran, A., Dosovitskiy, A., Weissenborn, D., Uszkoreit, J.,
Unterthiner, T.: Differentiable patch selection for image recognition. In: Proceed-
ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
pp. 2351–2360 (2021)
14. Croitoru, I., Bogolin, S.V., Leordeanu, M., Jin, H., Zisserman, A., Albanie, S.,
Liu, Y.: Teachtext: Crossmodal generalized distillation for text-video retrieval. In:
Proceedings of the IEEE/CVF International Conference on Computer Vision. pp.
11583–11593 (2021)
15. Dong, J., Li, X., Xu, C., Ji, S., He, Y., Yang, G., Wang, X.: Dual encoding for
zero-example video retrieval. In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition. pp. 9346–9355 (2019)
16. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner,
T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.:
An image is worth 16x16 words: Transformers for image recognition at scale. ICLR
(2021)
TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval 17
17. Dzabraev, M., Kalashnikov, M., Komkov, S., Petiushko, A.: Mdmmt: Multidomain
multimodal transformer for video retrieval. In: Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition. pp. 3354–3363 (2021)
18. Fabian Caba Heilbron, Victor Escorcia, B.G., Niebles, J.C.: Activitynet: A large-
scale video benchmark for human activity understanding. In: Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition. pp. 961–970 (2015)
19. Fang, H., Xiong, P., Xu, L., Chen, Y.: Clip2video: Mastering video-text retrieval
via image clip. arXiv preprint arXiv:2106.11097 (2021)
20. Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recog-
nition. In: Proceedings of the IEEE/CVF international conference on computer
vision. pp. 6202–6211 (2019)
21. Gabeur, V., Sun, C., Alahari, K., Schmid, C.: Multi-modal transformer for video
retrieval. In: European Conference on Computer Vision. pp. 214–229 (2020)
22. Gao, Z., Liu, J., Chen, S., Chang, D., Zhang, H., Yuan, J.: Clip2tv: An empir-
ical study on transformer-based methods for video-text retrieval. arXiv preprint
arXiv:2111.05610 (2021)
23. Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tul-
loch, A., Jia, Y., He, K.: Accurate, large minibatch sgd: Training imagenet in 1
hour. arXiv preprint arXiv:1706.02677 (2017)
24. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization 3rd int. In:
Conf. for Learning Representations, San (2014)
25. Krishna, R., Hata, K., Ren, F., Fei-Fei, L., Carlos Niebles, J.: Dense-captioning
events in videos. In: Proceedings of the IEEE international conference on computer
vision. pp. 706–715 (2017)
26. Lei, J., Li, L., Zhou, L., Gan, Z., Berg, T.L., Bansal, M., Liu, J.: Less is more:
Clipbert for video-and-language learning via sparse sampling. In: Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp.
7331–7341 (2021)
27. Li, G., Duan, N., Fang, Y., Gong, M., Jiang, D.: Unicoder-vl: A universal encoder
for vision and language by cross-modal pre-training. In: Proceedings of the AAAI
Conference on Artificial Intelligence. pp. 11336–11344. No. 07 (2020)
28. Li, L., Chen, Y.C., Cheng, Y., Gan, Z., Yu, L., Liu, J.: Hero: Hierarchical
encoder for video+ language omni-representation pre-training. arXiv preprint
arXiv:2005.00200 (2020)
29. Lin, J., Gan, C., Han, S.: Tsm: Temporal shift module for efficient video un-
derstanding. In: Proceedings of the IEEE International Conference on Computer
Vision (2019)
30. Liu, F., Ye, R.: A strong and robust baseline for text-image matching. In: Proceed-
ings of the 57th Annual Meeting of the Association for Computational Linguistics:
Student Research Workshop. pp. 169–176 (2019)
31. Liu, S., Fan, H., Qian, S., Chen, Y., Ding, W., Wang, Z.: Hit: Hierarchical trans-
former with momentum contrast for video-text retrieval. In: Proceedings of the
IEEE/CVF International Conference on Computer Vision. pp. 11915–11925 (2021)
32. Liu, Y., Albanie, S., Nagrani, A., Zisserman, A.: Use what you have: Video retrieval
using representations from collaborative experts. arXiv preprint arXiv:1907.13487
(2019)
33. Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., Hu, H.: Video swin trans-
former. arXiv preprint arXiv:2106.13230 (2021)
34. Luo, H., Ji, L., Shi, B., Huang, H., Duan, N., Li, T., Li, J., Bharti, T., Zhou,
M.: Univl: A unified video and language pre-training model for multimodal under-
standing and generation. arXiv preprint arXiv:2002.06353 (2020)
18 Y. Liu, P. Xiong, et al.
35. Luo, H., Ji, L., Zhong, M., Chen, Y., Lei, W., Duan, N., Li, T.: Clip4clip: An empir-
ical study of clip for end to end video clip retrieval. arXiv preprint arXiv:2104.08860
(2021)
36. Miech, A., Zhukov, D., Alayrac, J.B., Tapaswi, M., Laptev, I., Sivic, J.:
Howto100m: Learning a text-video embedding by watching hundred million nar-
rated video clips. In: Proceedings of the IEEE/CVF International Conference on
Computer Vision. pp. 2630–2640 (2019)
37. Patrick, M., Huang, P.Y., Asano, Y., Metze, F., Hauptmann, A., Henriques, J.,
Vedaldi, A.: Support-set bottlenecks for video-text representation learning. arXiv
preprint arXiv:2010.02824 (2020)
38. Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G.,
Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from
natural language supervision. In: International Conference on Machine Learning.
pp. 8748–8763 (2021)
39. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language
models are unsupervised multitask learners. OpenAI blog p. 9 (2019)
40. Rao, Y., Zhao, W., Liu, B., Lu, J., Zhou, J., Hsieh, C.J.: Dynamicvit: Efficient
vision transformers with dynamic token sparsification. Advances in neural infor-
mation processing systems (2021)
41. Rohrbach, A., Torabi, A., Rohrbach, M., Tandon, N., Pal, C., Larochelle, H.,
Courville, A., Schiele, B.: Movie description. International Journal of Computer
Vision pp. 94–120 (2017)
42. Smith, S.L., Turban, D.H., Hamblin, S., Hammerla, N.Y.: Offline bilingual word
vectors, orthogonal transformations and the inverted softmax. In: Proceedings of
the 5th International Conference on Learning Representations (2017)
43. Su, W., Zhu, X., Cao, Y., Li, B., Lu, L., Wei, F., Dai, J.: Vl-bert: Pre-training of
generic visual-linguistic representations. ICLR (2020)
44. Wang, J., Yang, X., Li, H., Wu, Z., Jiang, Y.G.: Efficient video transformers with
spatial-temporal token selection. arXiv preprint arXiv:2111.11591 (2021)
45. Wang, X., Zhu, L., Yang, Y.: T2vlad: global-local sequence alignment for text-
video retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition. pp. 5079–5088 (2021)
46. Wang, X., Wu, J., Chen, J., Li, L., Wang, Y.F., Wang, W.Y.: Vatex: A large-scale,
high-quality multilingual dataset for video-and-language research. In: Proceedings
of the IEEE/CVF International Conference on Computer Vision. pp. 4581–4591
(2019)
47. Xie, S., Sun, C., Huang, J., Tu, Z., Murphy, K.: Rethinking spatiotemporal feature
learning: Speed-accuracy trade-offs in video classification. In: Proceedings of the
European conference on computer vision (ECCV). pp. 305–321 (2018)
48. Xu, J., Mei, T., Yao, T., Rui, Y.: Msr-vtt: A large video description dataset for
bridging video and language. In: Proceedings of the IEEE conference on computer
vision and pattern recognition. pp. 5288–5296 (2016)
49. Yang, J., Bisk, Y., Gao, J.: Taco: Token-aware cascade contrastive learning for
video-text alignment. In: Proceedings of the IEEE/CVF International Conference
on Computer Vision. pp. 11562–11572 (2021)
50. Yu, Y., Kim, J., Kim, G.: A joint sequence fusion model for video question an-
swering and retrieval. In: Proceedings of the European Conference on Computer
Vision (ECCV). pp. 471–487 (2018)
51. Zhang, B., Hu, H., Sha, F.: Cross-modal and hierarchical modeling of video and
text. In: Proceedings of the European Conference on Computer Vision (ECCV).
pp. 374–390 (2018)
TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval 19
52. Zhang, H., Hao, Y., Ngo, C.W.: Token shift transformer for video classification.
In: Proceedings of the 29th ACM International Conference on Multimedia. pp.
917–925 (2021)
53. Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J., Gao, J.: Unified vision-language
pre-training for image captioning and vqa. In: Proceedings of the AAAI Conference
on Artificial Intelligence (2020)
20 Y. Liu, P. Xiong, et al.
MSRVTT-1kA DiDeMo
Method R@1 R@5 R@10 MdR MeanR Method R@1 R@5 R@10 MdR MeanR
QB-Norm[7] 47.2 73.0 83.0 2.0 - QB-Norm[7] 43.5 71.4 80.9 2.0 -
CAMoE[12] 47.3 74.2 84.5 2.0 11.9 CAMoE[12] 43.8 71.4 79.9 2.0 16.3
TS2-Net 51.1 76.9 85.6 1.0 11.7 TS2-Net 47.4 74.1 82.4 2.0 12.9
CLIP2TV[22] 52.9 78.5 86.5 1.0 12.8
TS2-Net(ViT16) 54.0 79.3 87.4 1.0 11.7
A Inverted Softmax.
The hubness phenomenon[30] is that a data point occurs among the k near-
est neighbors of other data points. Dual softmax loss (DSL) was mentioned in
CAMoE[12], which adopts a inverted softmax[42]. QB-Norm[7] proposes a query-
bank normalization with dynamic inverted softmax (DIS) to deal with hubness
problem. CLIP2TV[22] also reports its results with inverted softmax. We com-
pare their results with basic inverted softmax during inference in Tab.6. Our
results again surpass all other methods with significant improvement.
words. For example, our model correctly retrieves the video with ‘mental bowl’
rather than ‘glass bowl’ (and ‘overweight people’ rather than normal people) in
the bottom examples.
We show some failure cases as well in Fig.9, where our model fails to rank
the groundtruth video at the top. However, we could argue for these failure cases
and consider that our model may actually retrieve the more relevant video. For
example, in the left case, the video retrieved by our model (in the red box) seems
to be more relevant to the query text, since both ‘cup’ and ‘talking’ can be seen
in our results, while the ‘talking’ can not be seen in the ground truth.
Based on further analysis, we consider that there are also many vague and
general annotations in the datasets, such as the example shown in the right case
in Fig.9. Such query annotations account for 1-2% of the dataset. We believe
our model has a potential to gain in all metrics if such cases get fixed with more
discriminative annotations.
22 Y. Liu, P. Xiong, et al.
Fig. 8. Visualization of more text-video retrieval examples. We rank the retrieval re-
sults based on their similarity scores. Green boxed: the correctly retrieved groundtruth
video; Red boxed: incorrectly retrieved videos
TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval 23
Fig. 9. Visualization of some failure text-video retrieval examples. We rank the re-
trieval results based on their similarity scores. Green boxed: the correctly retrieved
groundtruth video; Red boxed: the incorrectly retrieved video by our model