Automated Question-Answer Generation for Evaluating RAG-based Chatbots

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

5 Citations (Scopus)
564 Downloads (Pure)

Abstract

In this research, we propose a framework to generate human-like question-answer pairs with long or factoid answers automatically and, based on them, automatically evaluate the quality of Retrieval-Augmented Generation (RAG). Our framework can also create datasets that assess hallucination levels of Large Language Models (LLMs) by simulating unanswerable questions. We then apply the framework to create a dataset of question-answer (QA) pairs based on more than 1,000 leaflets about the medical and administrative procedures of a hospital. The dataset was evaluated by hospital specialists, who confirmed that more than 50% of the QA pairs are applicable. Finally, we show that our framework can be used to evaluate LLM performance by using Llama-2-13B fine-tuned in Dutch (Vanroy, 2023) with the generated dataset, and show the method’s use in testing models with regard to answering unanswerable and factoid questions appears promising.

Original languageEnglish
Title of host publication1st Workshop on Patient-Oriented Language Processing, CL4Health 2024 at LREC-COLING 2024 - Workshop Proceedings
EditorsDina Demner-Fushman, Sophia Ananiadou, Paul Thompson, Brian Ondov
PublisherEuropean Language Resources Association (ELRA)
Pages204-214
Number of pages11
ISBN (Electronic)9782493814258
Publication statusPublished - 2024
Event1st Workshop on Patient-Oriented Language Processing, CL4Health 2024 - Torino, Italy
Duration: 20 May 202420 May 2024
Conference number: 1

Workshop

Workshop1st Workshop on Patient-Oriented Language Processing, CL4Health 2024
Country/TerritoryItaly
CityTorino
Period20/05/2420/05/24

Keywords

  • Chatbot Evaluation
  • Hallucination Detection
  • LLMs
  • Retrieval Augmented Generation

Fingerprint

Dive into the research topics of 'Automated Question-Answer Generation for Evaluating RAG-based Chatbots'. Together they form a unique fingerprint.

Cite this