TY - JOUR AU - AB - LXMERT: Learning Cross-Modality Encoder Representations from Transformers Hao Tan Mohit Bansal UNC Chapel Hill fhaotan, mbansalg@cs.unc.edu Abstract ships. There has been substantial past works in separately developing backbone models with bet- Vision-and-language reasoning requires an un- ter representations for the single modalities of vi- derstanding of visual concepts, language se- sion and of language. For visual-content under- mantics, and, most importantly, the align- standing, people have developed several backbone ment and relationships between these two models (Simonyan and Zisserman, 2014; Szegedy modalities. We thus propose the LXMERT et al., 2015; He et al., 2016) and shown their ef- (Learning Cross-Modality Encoder Represen- tations from Transformers) framework to learn fectiveness on large vision datasets (Deng et al., these vision-and-language connections. In 2009; Lin et al., 2014; Krishna et al., 2017). Pi- LXMERT, we build a large-scale Transformer oneering works (Girshick et al., 2014; Xu et al., model that consists of three encoders: an ob- 2015) also show the generalizability of these pre- ject relationship encoder, a language encoder, trained (especially on ImageNet) backbone mod- and a cross-modality encoder. Next, to en- els by fine-tuning them on different tasks. In terms dow our model with the capability of con- of language TI - LXMERT: Learning Cross-Modality Encoder Representations from Transformers JF - Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) DO - 10.18653/v1/d19-1514 DA - 2019-01-01 UR - https://www.deepdyve.com/lp/unpaywall/lxmert-learning-cross-modality-encoder-representations-from-0Kk2c2rAl4 DP - DeepDyve ER -