TY - JOUR
AU - 
AB - WARNING: This paper contains model outputs which are offensive in nature. 1 2 3 Eric Wallace , Shi Feng , Nikhil Kandpal , 1 4 Matt Gardner , Sameer Singh 1 2 Allen Institute for Artiﬁcial Intelligence, University of Maryland 3 4 Independent Researcher, University of California, Irvine ericw@allenai.org, sameer@uci.edu Abstract 2017; Ribeiro et al., 2018) and stress test neural machine translation (Belinkov and Bisk, 2018). Adversarial examples highlight model vulner- Adversarial attacks also facilitate interpretation, abilities and are useful for evaluation and in- e.g., by analyzing a model’s sensitivity to local terpretation. We deﬁne universal adversar- perturbations (Li et al., 2016; Feng et al., 2018). ial triggers: input-agnostic sequences of to- These attacks are typically generated for a spe- kens that trigger a model to produce a spe- ciﬁc input; are there attacks that work for any in- ciﬁc prediction when concatenated to any in- put from a dataset. We propose a gradient- put? We search for universal adversarial trig- guided search over tokens which ﬁnds short gers: input-agnostic sequences of tokens that trigger sequences (e.g., one word for classi- trigger a model to produce a speciﬁc prediction ﬁcation and four words for language model- when concatenated to 
TI - Universal Adversarial Triggers for Attacking and Analyzing NLP
JF - Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
DO - 10.18653/v1/d19-1221
DA - 2019-01-01
UR - https://www.deepdyve.com/lp/unpaywall/universal-adversarial-triggers-for-attacking-and-analyzing-nlp-fhmsxuuZrF
DP - DeepDyve
ER -