Abstract
In this research, we propose a framework to generate human-like question-answer pairs with long or factoid answers automatically and, based on them, automatically evaluate the quality of Retrieval-Augmented Generation (RAG). Our framework can also create datasets that assess hallucination levels of Large Language Models (LLMs) by simulating unanswerable questions. We then apply the framework to create a dataset of question-answer (QA) pairs based on more than 1,000 leaflets about the medical and administrative procedures of a hospital. The dataset was evaluated by hospital specialists, who confirmed that more than 50% of the QA pairs are applicable. Finally, we show that our framework can be used to evaluate LLM performance by using Llama-2-13B fine-tuned in Dutch (Vanroy, 2023) with the generated dataset, and show the method’s use in testing models with regard to answering unanswerable and factoid questions appears promising.
| Original language | English |
|---|---|
| Title of host publication | 1st Workshop on Patient-Oriented Language Processing, CL4Health 2024 at LREC-COLING 2024 - Workshop Proceedings |
| Editors | Dina Demner-Fushman, Sophia Ananiadou, Paul Thompson, Brian Ondov |
| Publisher | European Language Resources Association (ELRA) |
| Pages | 204-214 |
| Number of pages | 11 |
| ISBN (Electronic) | 9782493814258 |
| Publication status | Published - 2024 |
| Event | 1st Workshop on Patient-Oriented Language Processing, CL4Health 2024 - Torino, Italy Duration: 20 May 2024 → 20 May 2024 Conference number: 1 |
Workshop
| Workshop | 1st Workshop on Patient-Oriented Language Processing, CL4Health 2024 |
|---|---|
| Country/Territory | Italy |
| City | Torino |
| Period | 20/05/24 → 20/05/24 |
Keywords
- Chatbot Evaluation
- Hallucination Detection
- LLMs
- Retrieval Augmented Generation