In the fast-growing field of artificial intelligence, the evaluation of answers generated by Anchored Question-Answer (AQA) systems is crucial. These systems, which answer questions based on reference documents, require precise and reliable evaluation tools to guarantee the quality of the answers provided. But given the complexity of these evaluations, can we trust Large Scale Language Models (LLM) to judge these answers automatically?
GroUSE: An innovative evaluation dataset
📚 It's to answer this question that the teams at ILLUIN Technology have developed GroUSE (Grounded Question Answering Evaluator). GroUSE is a meta-evaluation dataset designed to assess the ability of LLMs to judge the quality of answers provided by a RAG system. The benchmark is based on 144 carefully designed tests, each comprising 🖋️ :
-
- One question
- A list of references
- A (potentially erroneous) answer
- Ratings based on six criteria: relevance, completeness, faithfulness to references, etc.
This approach makes it possible to assess the ability of LLMs to make coherent and accurate judgments of responses in a variety of scenarios.
The limits of automated evaluations
In GAN systems, it is common practice to use LLMs to automatically evaluate responses. However, these automatic evaluators are often prone to errors, such as hallucinations (providing information that is not in the source documents). Until now, human evaluation has been the benchmark for accuracy. However, this approach is not scalable for regular, large-scale evaluations.
That's where GroUSE comes in, offering a solution to test whether LLMs can truly substitute for human expertise in this crucial role 🤖
Results: GPT-4 and Llama-3 live up to expectations
Initial results on GroUSE are promising:
🎖️ GPT-4 stands out with an accuracy of 95%, approaching human performance of 98%.
📂 Among open source models, Llama-3 (70b) emerges as the best, with a score of 79%.
These results show that LLMs can be powerful tools for evaluating RAG systems, although more needs to be done to improve the performance of open source models. One of the avenues explored in the study is to refine models by training them on reasoning traces, thereby improving their performance.
Why is GroUSE an essential tool?
One of the main challenges in evaluating RAG systems is the accuracy of the judgments made by LLMs. GroUSE sheds new light by testing these models in practical scenarios. Unlike conventional evaluation methods, based on correlation with a strong evaluator, GroUSE offers a more nuanced and accurate evaluation.
The results also show that the correlation with a good assessor measures the relative preference between answers, while the GroUSE success rate makes it possible to calibrate judgments on practical cases, thus guaranteeing a more robust assessment.
Towards a more reliable future for AGR systems
The introduction of GroUSE by ILLUIN Technology represents a major step forward in the improvement of RAG systems and their evaluation. By providing a precise and rigorous framework, this benchmark makes it possible to measure and improve the reliability of automatic evaluators in practical contexts. With promising results for GPT-4 and Llama-3, GroUSE points the way to a future where LLMs could play a key role in the automated evaluation of AI systems.
Professionals working on customized GenAI systems, whether as part of ILLUIN Search or ILLUIN Dialogue, will find GroUSE an invaluable tool for optimizing the quality of their question-and-answer systems.
Thanks
👏 A very big bravo to all contributors Sacha Muller, António Loison, Bilel Omrani and Gautier Viaud!
Find out more about GroUSE
See the following links:
📝 The full research paper: arxiv.org/abs/2409.06595
🗞️ The detailed blog post: huggingface.co/spaces/illuin/grouse
🐙 Source code: github.com/illuin-tech/grouse
📚 GroUSE dataset: huggingface.co/datasets/illuin/grouse













