TY - JOUR AU - AB - MVF 8F+16F Conventionally, spatiotemporal modeling network and its 52 MVF 16F complexity are the two most concentrated research topics 50 MVF 8F TSM in video action recognition. Existing state-of-the-art meth- 8F+16F ods have achieved excellent accuracy regardless of the com- ECO Lite En plexity meanwhile efficient spatiotemporal modeling solu- TSM 16F tions are slightly inferior in performance. In this paper, we 46 NL I3D TSM attempt to acquire both efficiency and effectiveness simulta- 8F neously. First of all, besides traditionally treating HWT video frames as space-time signal (viewing from the Height- I3D ECO 16F Width spatial plane), we propose to also model video from the other two Height-Time and Width-Time planes, to cap- ture the dynamics of video thoroughly. Secondly, our model 150M ECO 30M 50M 8F is designed based on 2D CNN backbones and model com- # Parameters plexity is well kept in mind by design. Specifically, we in- 0 50 100 150 200 250 300 350 400 troduce a novel multi-view fusion (MVF) module to exploit FLOPs/Video (G) video dynamics using separable convolution for efficiency. It is a plug-and-play module and can be inserted into off-the- Figure 1: MVF achieves state-of-the-art performance on shelf 2D CNNs to TI - MVFNet: Multi-View Fusion Network for Efficient Video Recognition JF - Proceedings of the AAAI Conference on Artificial Intelligence DO - 10.1609/aaai.v35i4.16401 DA - 2021-05-18 UR - https://www.deepdyve.com/lp/unpaywall/mvfnet-multi-view-fusion-network-for-efficient-video-recognition-LCihTaGLMF DP - DeepDyve ER -