TY - JOUR AU - AB - WARNING: This paper contains model outputs which are offensive in nature. 1 2 3 Eric Wallace , Shi Feng , Nikhil Kandpal , 1 4 Matt Gardner , Sameer Singh 1 2 Allen Institute for Artificial Intelligence, University of Maryland 3 4 Independent Researcher, University of California, Irvine ericw@allenai.org, sameer@uci.edu Abstract 2017; Ribeiro et al., 2018) and stress test neural machine translation (Belinkov and Bisk, 2018). Adversarial examples highlight model vulner- Adversarial attacks also facilitate interpretation, abilities and are useful for evaluation and in- e.g., by analyzing a model’s sensitivity to local terpretation. We define universal adversar- perturbations (Li et al., 2016; Feng et al., 2018). ial triggers: input-agnostic sequences of to- These attacks are typically generated for a spe- kens that trigger a model to produce a spe- cific input; are there attacks that work for any in- cific prediction when concatenated to any in- put from a dataset. We propose a gradient- put? We search for universal adversarial trig- guided search over tokens which finds short gers: input-agnostic sequences of tokens that trigger sequences (e.g., one word for classi- trigger a model to produce a specific prediction fication and four words for language model- when concatenated to TI - Universal Adversarial Triggers for Attacking and Analyzing NLP JF - Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) DO - 10.18653/v1/d19-1221 DA - 2019-01-01 UR - https://www.deepdyve.com/lp/unpaywall/universal-adversarial-triggers-for-attacking-and-analyzing-nlp-fhmsxuuZrF DP - DeepDyve ER -