TY - JOUR AU - Reitter, David AB - To advance models of multimodal context, we introduce a simple yet powerful neural ar- chitecture for data that combines vision and natural language. The “Bounding Boxes in Text Transformer” (B2T2) also leverages ref- erential information binding words to portions of the image in a single unified architecture. Q: What was [1] doing before he sat in his living room? B2T2 is highly effective on the Visual Com- A : He was reading [10]. monsense Reasoning benchmark , achieving a A : He was taking a shower. X A : [0] was sleeping until the noise [1] was making woke him up. new state-of-the-art with a 25% relative reduc- A : He was sleeping in his bedroom. tion in error rate compared to published base- R : His clothes are disheveled and his face is glistening like he’s lines and obtaining the best performance to sweaty. R : [0] does not look wet yet, but [0] looks like his hair is wet, and date on the public leaderboard (as of May 22, bathrobes are what you wear before or after a shower. R : He is still wearing his bathrobe. X 2019). A detailed ablation analysis shows that 3 R TI - Fusion of Detected Objects in Text for Visual Question Answering JF - Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) DO - 10.18653/v1/d19-1219 DA - 2019-01-01 UR - https://www.deepdyve.com/lp/unpaywall/fusion-of-detected-objects-in-text-for-visual-question-answering-0EAfVeYSqA DP - DeepDyve ER -