Peter Wildeford from post on gpt5
Benchmark | Quick explanation | All quotes from the article | Link | # Chronology in the article |
---|---|---|---|---|
BIG-Bench Hard | A diverse set of questions — commonsense reasoning, social understanding, physics — specifically chosen to be challenging for LLMs. | “AI models have steadily improved at answering diverse questions—from commonsense reasoning to understanding social situations and physics. This is demonstrated on the ‘BIG-Bench Hard’ benchmark, which features diverse questions specifically chosen to challenge LLMs.” | https://github.com/google/BIG-bench | 1 |
ImageNet | Standard benchmark for image recognition accuracy. | “Every two years, the compute needed to get the same performance across a wide range of models has decreased tenfold.” (based on the ImageNet benchmark) | https://www.image-net.org/ | 2 |
GPQA Diamond | PhD-level science questions designed to be “Google-proof,” so only domain experts can answer. | “Consider the GPQA Diamond benchmark — a set of scientific questions designed so that people with PhDs in the field can mostly answer them, but non-experts can’t, even with 30 minutes of access to Google.” | https://arxiv.org/abs/2311.12022 | 3 |
FrontierMath | Extremely difficult math problems, from Olympiad-level up to challenges for professional mathematicians. | “Epoch AI created Frontier Math — a benchmark of insanely hard mathematical problems. The easiest 25% are similar to Olympiad-level problems. The most difficult 25% are, according to Fields Medalist Terence Tao, ‘extremely challenging,’ and would typically need an expert in that branch of mathematics to solve them.” | https://arxiv.org/abs/2411.04872 | 4 |
SWE-bench Verified | Real-world software engineering tasks from GitHub, typically ~1 hour long, testing multi-application use. | “SWE-bench Verified is a benchmark of real-world software engineering problems from GitHub that typically take about an hour to complete.” | https://www.swebench.com/ | 5 |
RE-Bench | Seven difficult AI research engineering tasks (e.g., fine-tuning, predicting experiments) designed to mimic real AI R&D. | “Now consider perhaps the world’s most important benchmark: METR’s set of difficult AI research engineering problems (‘RE Bench’).” | https://arxiv.org/abs/2411.15114 | 6 |
METR Time Horizon benchmark | Categorises AI/computer tasks by the time they take humans (seconds → weeks) to measure models’ sustained performance ability. | “METR made a broader benchmark of computer use tasks categorised by time horizon. GPT-2 was only able to do tasks that took humans a few seconds; GPT-4 managed a few minutes; and the latest reasoning models could do tasks that took humans just under an hour.” | https://metr.org/ | 7 |
MMLU | Compilation of college and professional knowledge tests covering multiple subjects. | “MMLU: compilation of college and professional knowledge tests.” | https://arxiv.org/abs/2009.03300 | 8 |
Humanity’s Last Exam | 3,000 extremely hard questions at the frontier of human knowledge. | “Humanity’s last exam: a compilation of 3,000 even harder questions at the frontier of human knowledge.” | (No official link found — may be internal/unnamed dataset) | 9 |
MATH | High school mathematics competition questions. | “MATH: High school math competition questions.” | https://arxiv.org/abs/2103.03874 | 10 |
Situational Awareness benchmark | Tests whether a model understands itself, its outputs, and its deployment/training status. | “Situational Awareness: questions designed to test if model understands itself and context.” | (No standalone link found — likely internal to METR research) | 11 |
Lech Mazur gives us the Short Story Creative Writing benchmark, where GPT-5-Thinking comes out on top. I continue to not trust the grading on writing but it’s not meaningless.