TY - JOUR AU - AB - Mask-Predict: Parallel Decoding of Conditional Masked Language Models Marjan Ghazvininejad Omer Levy Yinhan Liu Luke Zettlemoyer Facebook AI Research Seattle, WA Abstract entire sequence (left and right context) to pre- dict each masked word. We train with a simple Most machine translation systems generate masking scheme where the number of masked tar- text autoregressively from left to right. We, get tokens is distributed uniformly, presenting the instead, use a masked language modeling ob- model with both easy (single mask) and difficult jective to train a model to predict any subset (completely masked) examples. Unlike recently of the target words, conditioned on both the proposed insertion models (Gu et al., 2019; Stern input text and a partially masked target trans- lation. This approach allows for efficient it- et al., 2019), which treat each token as a separate erative decoding, where we first predict all training instance, CMLMs can train from the en- of the target words non-autoregressively, and tire sequence in parallel, resulting in much faster then repeatedly mask out and regenerate the training. subset of words that the model is least confi- We also introduce a new decoding algorithm, dent about. By applying this strategy for a con- mask-predict, TI - Mask-Predict: Parallel Decoding of Conditional Masked Language Models JF - Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) DO - 10.18653/v1/d19-1633 DA - 2019-01-01 UR - https://www.deepdyve.com/lp/unpaywall/mask-predict-parallel-decoding-of-conditional-masked-language-models-90mkZBjvyr DP - DeepDyve ER -