This research paper describes a new way to test how good large language models (LLMs) are at solving math problems. The researchers created a special test called LiveMathBench which uses difficult math problems from contests like the Chinese National Mathematical Olympiad and the American Mathematics Competition. They also created a new scoring system called G-Pass@k that measures not only if the LLM gets the right answer, but also how often it gets the right answer when it tries multiple times. They found that even the best LLMs had trouble consistently getting the right answers on these tough math problems. This means that simply making LLMs bigger doesn’t always make them better at math, and we need to find new ways to teach LLMs how to solve problems reliably.
The #1 Hub for Anything AI Video