TY - JOUR AU - Abend, Omri AB - Neural knowledge-grounded generative mod- els for dialogue often produce content that is factually inconsistent with the knowledge they rely on, making them unreliable and lim- iting their applicability. Inspired by recent work on evaluating factual consistency in ab- stractive summarization, we propose an au- tomatic evaluation metric for factual consis- tency in knowledge-grounded dialogue using automatic question generation and question an- swering. Our metric, denoted Q , compares answer spans using natural language inference (NLI), instead of token-based matching as done in previous work. To foster proper eval- uation, we curate a novel dataset of dialogue system outputs for the Wizard-of-Wikipedia dataset, manually annotated for factual consis- Figure 1: An example from our dataset. Human mes- tency. We perform a thorough meta-evaluation sages are in Blue, the generated response is in Orange of Q against other metrics using this dataset and the grounding knowledge is in Black at the bottom. and two others, where it consistently shows The factual inconsistency is marked in Red. higher correlation with human judgements. 1 Introduction (Sellam et al., 2020; Xu et al., 2020; Goodrich et al., 2019). Yet, evaluating grounded dialogues poses Generative conversational agents show remarkable additional challenges, since dialogue outputs may TI - Q2: : Evaluating Factual Consistency in Knowledge-Grounded Dialogues via Question Generation and Question Answering JF - Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing DO - 10.18653/v1/2021.emnlp-main.619 DA - 2021-01-01 UR - https://www.deepdyve.com/lp/unpaywall/q2-evaluating-factual-consistency-in-knowledge-grounded-dialogues-via-2E6aFo8lgN DP - DeepDyve ER -