TY - JOUR AU - Ghanem, Bernard AB - 30 Charades- ActivityNet STA Captions The recent and increasing interest in video-language re- search has driven the development of large-scale datasets that enable data-intensive machine learning techniques. In DiDeMo comparison, limited effort has been made at assessing the fitness of these datasets for the video-language ground- ing task. Recent works have begun to discover signifi- TACoS cant limitations in these datasets, suggesting that state-of- the-art techniques commonly overfit to hidden dataset bi- ases. In this work, we present MAD (Movie Audio Descrip- 5 MAD tions), a novel benchmark that departs from the paradigm (ours) of augmenting existing video datasets with text annota- 1 10 100 tions and focuses on crawling and aligning available au- Average Duration per Video (min) dio descriptions of mainstream movies. MAD contains over 384; 000 natural language sentences grounded in over Figure 1. Comparison of video-language grounding datasets. 1; 200 hours of videos and exhibits a significant reduction in The circle size measures the language vocabulary diversity. The the currently diagnosed biases for video-language ground- videos in MAD are orders of magnitude longer in duration than ing datasets. MAD’s collection strategy enables a novel previous datasets (110min), annotated with natural, highly de- and more challenging TI - MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions JF - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) DO - 10.1109/cvpr52688.2022.00497 DA - 2022-06-01 UR - https://www.deepdyve.com/lp/unpaywall/mad-a-scalable-dataset-for-language-grounding-in-videos-from-movie-Ob3A7RWcb0 DP - DeepDyve ER -