TY - JOUR AU - AB - GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding 1 1 2 3 Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, 2 1 Omer Levy, and Samuel R. Bowman New York University, New York, NY Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, WA DeepMind, London, UK falexwang,amanpreet,bowmang@nyu.edu fjulianjm,omerlevyg@cs.washington.edu felixhill@google.com Human ability to understand language is gen- Corpus jTrainj Task Domain eral, flexible, and robust. In contrast, most NLU Single-Sentence Tasks models above the word level are designed for a CoLA 8.5k acceptability misc. specific task and struggle with out-of-domain data. SST-2 67k sentiment movie reviews If we aspire to develop models with understand- Similarity and Paraphrase Tasks ing beyond the detection of superficial correspon- MRPC 3.7k paraphrase news dences between inputs and outputs, then it is crit- STS-B 7k textual sim. misc. QQP 364k paraphrase online QA ical to develop a unified model that can execute a Inference Tasks range of linguistic tasks across different domains. To facilitate research in this direction, we MNLI 393k NLI misc. QNLI 108k QA/NLI Wikipedia present the General Language Understanding RTE 2.5k NLI misc. Evaluation (GLUE, gluebenchmark.com): a WNLI 634 coref./NLI fiction books benchmark of TI - GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding JF - Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP DO - 10.18653/v1/w18-5446 DA - 2018-01-01 UR - https://www.deepdyve.com/lp/unpaywall/glue-a-multi-task-benchmark-and-analysis-platform-for-natural-language-jecjNTxtFT DP - DeepDyve ER -