TY - JOUR AU - AB - What does BERT look at? An Analysis of BERT’s Attention y y z y Kevin Clark Urvashi Khandelwal Omer Levy Christopher D. Manning Computer Science Department, Stanford University Facebook AI Research fkevclark,urvashik,manningg@cs.stanford.edu omerlevy@fb.com Abstract study the attention maps of a pre-trained model. Attention (Bahdanau et al., 2015) has been a Large pre-trained neural networks such as highly successful neural network component. It is BERT have had great recent success in NLP, naturally interpretable because an attention weight motivating a growing body of research investi- has a clear meaning: how much a particular word gating what aspects of language they are able will be weighted when computing the next repre- to learn from unlabeled data. Most recent anal- sentation for the current word. Our analysis fo- ysis has focused on model outputs (e.g., lan- cuses on the 144 attention heads in BERT (De- guage model surprisal) or internal vector rep- resentations (e.g., probing classifiers). Com- vlin et al., 2019), a large pre-trained Transformer plementary to these works, we propose meth- (Vaswani et al., 2017) network that has demon- ods for analyzing the attention mechanisms of strated excellent performance on many tasks. pre-trained models and apply them to BERT. We first TI - What Does BERT Look at? An Analysis of BERT’s Attention JF - Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP DO - 10.18653/v1/w19-4828 DA - 2019-01-01 UR - https://www.deepdyve.com/lp/unpaywall/what-does-bert-look-at-an-analysis-of-bert-s-attention-9H2KQAubmv DP - DeepDyve ER -