TY - JOUR AU - AB - 1 1 1;2 1;2 3 Qiang Wang , Bei Li , Tong Xiao , Jingbo Zhu , Changliang Li , 4 4 Derek F. Wong , Lidia S. Chao NLP Lab, Northeastern University, Shenyang, China NiuTrans Co., Ltd., Shenyang, China Kingsoft AI Lab, Beijing, China 4 2 NLP CT Lab, University of Macau, Macau, China wangqiangneu@gmail.com, libei neu@outlook.com, fxiaotong,zhujingbog@mail.neu.edu.com, lichangliang@kingsoft.com, fderekfw,lidiascg@um.edu.mo Abstract scale tasks (Vaswani et al., 2017). In particu- lar, approaches of this kind benefit greatly from Transformer is the state-of-the-art model in a wide network with more hidden states (a.k.a. recent machine translation evaluations. Two Transformer-Big), whereas simply deepening the strands of research are promising to im- network has not been found to outperform the prove models of this kind: the first uses “shallow” counterpart (Bapna et al., 2018). Do wide networks (a.k.a. Transformer-Big) and deep models help Transformer? It is still an open has been the de facto standard for the de- velopment of the Transformer system, and question for the discipline. the other uses deeper language representation For vanilla Transformer, learning deeper net- but faces the difficulty arising from learn- works is not easy because there is already a rel- ing deep networks. Here, we TI - Learning Deep Transformer Models for Machine Translation JF - Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics DO - 10.18653/v1/p19-1176 DA - 2019-01-01 UR - https://www.deepdyve.com/lp/unpaywall/learning-deep-transformer-models-for-machine-translation-2LzsckByOZ DP - DeepDyve ER -