This notebook describes FACTS Grounding, a new system that tests how well large language models (LLMs) can give accurate answers based on long documents. FACTS Grounding uses a collection of documents and questions created by humans to challenge LLMs. The system then uses other LLMs as judges to decide if the answers are accurate and if they follow the instructions in the question. The goal is to see how well LLMs can understand and use information from long texts, without making things up or ignoring what the question asked. The researchers found that using multiple LLM judges is important because LLMs tend to be biased towards their own answers. FACTS Grounding will be continuously updated with new models, helping researchers improve the accuracy and reliability of LLMs.
https://storage.googleapis.com/deepmind-media/FACTS/FACTS_grounding_paper.pdf