TY - JOUR AU - AB - 1 1 1 2 1 Yen-Chun Chen , Zhe Gan , Yu Cheng , Jingzhou Liu , Jingjing Liu 1 2 Microsoft Dynamics 365 AI Research Carnegie Mellon University {yen-chun.chen,zhe.gan,yu.cheng,jinjl}@microsoft.com; liujingzhou@cs.cmu.edu Abstract However, beyond common practice of finetun- ing BERT for language understanding (Wang et al., Large-scale pre-trained language model such 2019), applying BERT to language generation still as BERT has achieved great success in lan- remains an open question. Text generation aims guage understanding tasks. However, it re- to generate natural language sentences conditioned mains an open question how to utilize BERT on certain input, with applications ranging from for language generation. In this paper, we machine translation (Cho et al., 2014; Sutskever present a novel approach, Conditional Masked et al., 2014; Bahdanau et al., 2015), text sum- Language Modeling (C-MLM), to enable the finetuning of BERT on target generation tasks. marization (Nallapati et al., 2016; Gehring et al., The finetuned BERT (teacher) is exploited 2017; Chen and Bansal, 2018), to image caption- as extra supervision to improve conventional ing (Vinyals et al., 2015; Xu et al., 2015; Gan et al., Seq2Seq models (student) for better text gen- 2017). In this work, we study how to use BERT TI - Distilling Knowledge Learned in BERT for Text Generation JF - Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics DO - 10.18653/v1/2020.acl-main.705 DA - 2020-01-01 UR - https://www.deepdyve.com/lp/unpaywall/distilling-knowledge-learned-in-bert-for-text-generation-UL0t0BSGop DP - DeepDyve ER -