TY - JOUR AU - AB - Thibault Sellam Dipanjan Das Ankur P. Parikh Google Research New York, NY {tsellam, dipanjand, aparikh }@google.com Abstract evaluation metrics, which provide an acceptable proxy for quality and are very cheap to compute. Text generation has made significant advances This paper investigates sentence-level, reference- in the last few years. Yet, evaluation met- based metrics, which describe the extent to which rics have lagged behind, as the most popu- a candidate sentence is similar to a reference one. lar choices (e.g., BLEU and ROUGE) may The exact definition of similarity may range from correlate poorly with human judgments. We string overlap to logical entailment. propose BLEURT, a learned evaluation met- ric based on BERT that can model human The first generation of metrics relied on hand- judgments with a few thousand possibly bi- crafted rules that measure the surface similarity ased training examples. A key aspect of our between the sentences. To illustrate, BLEU (Pa- approach is a novel pre-training scheme that pineni et al., 2002) and ROUGE (Lin, 2004), two uses millions of synthetic examples to help the popular metrics, rely on N-gram overlap. Because model generalize. BLEURT provides state-of- those metrics are only sensitive to lexical vari- the-art results TI - BLEURT: Learning Robust Metrics for Text Generation JF - Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics DO - 10.18653/v1/2020.acl-main.704 DA - 2020-01-01 UR - https://www.deepdyve.com/lp/unpaywall/bleurt-learning-robust-metrics-for-text-generation-cZaUoHISgR DP - DeepDyve ER -