TY - JOUR
AU - Lu, Wei
AB - Video grounding aims to localize a moment from an untrimmed video for a given textual query. Existing ap- proaches focus more on the alignment of visual and lan- guage stimuli with various likelihood-based matching or regression strategies, i.e., P (YjX). Consequently, these models may suffer from spurious correlations between the language and video features due to the selection bias of the dataset. 1) To uncover the causality behind the model and data, we ﬁrst propose a novel paradigm from the per- spective of the causal inference, i.e., interventional video grounding (IVG) that leverages backdoor adjustment to deconfound the selection bias based on structured causal Figure 1: (a) Illustration of video grounding. (b) Spuri- model (SCM) and do-calculus P (Yjdo(X)). Then, we ous correlations between object “people” and “vacuum” and present a simple yet effective method to approximate the the activity “people are holding a vacuum” in the Charades- unobserved confounder as it cannot be directly sampled TA [24] dataset. from the dataset. 2) Meanwhile, we introduce a dual con- trastive learning approach (DCL) to better align the text text descriptions and video content. It has been widely used and video by maximizing the mutual information (MI) be- in many 
TI - Interventional Video Grounding with Dual Contrastive Learning
JF - 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
DO - 10.1109/cvpr46437.2021.00279
DA - 2021-06-01
UR - https://www.deepdyve.com/lp/unpaywall/interventional-video-grounding-with-dual-contrastive-learning-UIOA3h8Gb0
DP - DeepDyve
ER -