TY - JOUR AU - AB - Unicoder: A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks x x x y y Haoyang Huang Yaobo Liang Nan Duan Ming Gong Linjun Shou y x Daxin Jiang Ming Zhou Microsoft Research Asia, Beijing, China STCA NLP Group, Microsoft, Beijing, China fhaohua,yalia,nanduan,migon,lisho,djiang,mingzhoug@microsoft.com Abstract for cross-lingual tasks. Multilingual BERT trains a BERT model based on multilingual Wikipedia, We present Unicoder, a universal language en- which covers 104 languages. As its vocabulary coder that is insensitive to different languages. contains tokens from all languages, Multilingual Given an arbitrary NLP task, a model can be BERT can be used to cross-lingual tasks directly. trained with Unicoder using training data in XLM further improves Multilingual BERT by in- one language and directly applied to inputs troducing a translation language model (TLM). of the same task in other languages. Com- paring to similar efforts such as Multilingual TLM takes a concatenation of a bilingual sentence BERT (Devlin et al., 2018) and XLM (Lam- pair as input and performs masked language model ple and Conneau, 2019), three new cross- based on it. By doing this, it learns the mappings lingual pre-training tasks are proposed, in- among different languages and performs good on cluding TI - Unicoder: A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks JF - Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) DO - 10.18653/v1/d19-1252 DA - 2019-01-01 UR - https://www.deepdyve.com/lp/unpaywall/unicoder-a-universal-language-encoder-by-pre-training-with-multiple-CDTLS0ZGVi DP - DeepDyve ER -