TY - JOUR
AU - 
AB - Thibault Sellam Dipanjan Das Ankur P. Parikh Google Research New York, NY {tsellam, dipanjand, aparikh }@google.com Abstract evaluation metrics, which provide an acceptable proxy for quality and are very cheap to compute. Text generation has made signiﬁcant advances This paper investigates sentence-level, reference- in the last few years. Yet, evaluation met- based metrics, which describe the extent to which rics have lagged behind, as the most popu- a candidate sentence is similar to a reference one. lar choices (e.g., BLEU and ROUGE) may The exact deﬁnition of similarity may range from correlate poorly with human judgments. We string overlap to logical entailment. propose BLEURT, a learned evaluation met- ric based on BERT that can model human The ﬁrst generation of metrics relied on hand- judgments with a few thousand possibly bi- crafted rules that measure the surface similarity ased training examples. A key aspect of our between the sentences. To illustrate, BLEU (Pa- approach is a novel pre-training scheme that pineni et al., 2002) and ROUGE (Lin, 2004), two uses millions of synthetic examples to help the popular metrics, rely on N-gram overlap. Because model generalize. BLEURT provides state-of- those metrics are only sensitive to lexical vari- the-art results 
TI - BLEURT: Learning Robust Metrics for Text Generation
JF - Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
DO - 10.18653/v1/2020.acl-main.704
DA - 2020-01-01
UR - https://www.deepdyve.com/lp/unpaywall/bleurt-learning-robust-metrics-for-text-generation-cZaUoHISgR
DP - DeepDyve
ER -