TY - JOUR AU - Lu, Wei AB - Video grounding aims to localize a moment from an untrimmed video for a given textual query. Existing ap- proaches focus more on the alignment of visual and lan- guage stimuli with various likelihood-based matching or regression strategies, i.e., P (YjX). Consequently, these models may suffer from spurious correlations between the language and video features due to the selection bias of the dataset. 1) To uncover the causality behind the model and data, we first propose a novel paradigm from the per- spective of the causal inference, i.e., interventional video grounding (IVG) that leverages backdoor adjustment to deconfound the selection bias based on structured causal Figure 1: (a) Illustration of video grounding. (b) Spuri- model (SCM) and do-calculus P (Yjdo(X)). Then, we ous correlations between object “people” and “vacuum” and present a simple yet effective method to approximate the the activity “people are holding a vacuum” in the Charades- unobserved confounder as it cannot be directly sampled TA [24] dataset. from the dataset. 2) Meanwhile, we introduce a dual con- trastive learning approach (DCL) to better align the text text descriptions and video content. It has been widely used and video by maximizing the mutual information (MI) be- in many TI - Interventional Video Grounding with Dual Contrastive Learning JF - 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) DO - 10.1109/cvpr46437.2021.00279 DA - 2021-06-01 UR - https://www.deepdyve.com/lp/unpaywall/interventional-video-grounding-with-dual-contrastive-learning-UIOA3h8Gb0 DP - DeepDyve ER -