Peter Wildeford from post on gpt5

From The case for AGI 2030

Benchmark Quick explanation All quotes from the article Link # Chronology in the article
BIG-Bench Hard A diverse set of questions — commonsense reasoning, social understanding, physics — specifically chosen to be challenging for LLMs. “AI models have steadily improved at answering diverse questions—from commonsense reasoning to understanding social situations and physics. This is demonstrated on the ‘BIG-Bench Hard’ benchmark, which features diverse questions specifically chosen to challenge LLMs.” https://github.com/google/BIG-bench 1
ImageNet Standard benchmark for image recognition accuracy. “Every two years, the compute needed to get the same performance across a wide range of models has decreased tenfold.” (based on the ImageNet benchmark) https://www.image-net.org/ 2
GPQA Diamond PhD-level science questions designed to be “Google-proof,” so only domain experts can answer. “Consider the GPQA Diamond benchmark — a set of scientific questions designed so that people with PhDs in the field can mostly answer them, but non-experts can’t, even with 30 minutes of access to Google.” https://arxiv.org/abs/2311.12022 3
FrontierMath Extremely difficult math problems, from Olympiad-level up to challenges for professional mathematicians. “Epoch AI created Frontier Math — a benchmark of insanely hard mathematical problems. The easiest 25% are similar to Olympiad-level problems. The most difficult 25% are, according to Fields Medalist Terence Tao, ‘extremely challenging,’ and would typically need an expert in that branch of mathematics to solve them.” https://arxiv.org/abs/2411.04872 4
SWE-bench Verified Real-world software engineering tasks from GitHub, typically ~1 hour long, testing multi-application use. “SWE-bench Verified is a benchmark of real-world software engineering problems from GitHub that typically take about an hour to complete.” https://www.swebench.com/ 5
RE-Bench Seven difficult AI research engineering tasks (e.g., fine-tuning, predicting experiments) designed to mimic real AI R&D. “Now consider perhaps the world’s most important benchmark: METR’s set of difficult AI research engineering problems (‘RE Bench’).” https://arxiv.org/abs/2411.15114 6
METR Time Horizon benchmark Categorises AI/computer tasks by the time they take humans (seconds → weeks) to measure models’ sustained performance ability. “METR made a broader benchmark of computer use tasks categorised by time horizon. GPT-2 was only able to do tasks that took humans a few seconds; GPT-4 managed a few minutes; and the latest reasoning models could do tasks that took humans just under an hour.” https://metr.org/ 7
MMLU Compilation of college and professional knowledge tests covering multiple subjects. “MMLU: compilation of college and professional knowledge tests.” https://arxiv.org/abs/2009.03300 8
Humanity’s Last Exam 3,000 extremely hard questions at the frontier of human knowledge. “Humanity’s last exam: a compilation of 3,000 even harder questions at the frontier of human knowledge.” (No official link found — may be internal/unnamed dataset) 9
MATH High school mathematics competition questions. “MATH: High school math competition questions.” https://arxiv.org/abs/2103.03874 10
Situational Awareness benchmark Tests whether a model understands itself, its outputs, and its deployment/training status. “Situational Awareness: questions designed to test if model understands itself and context.” (No standalone link found — likely internal to METR research) 11

Creative writing

Lech Mazur gives us the Short Story Creative Writing benchmark, where GPT-5-Thinking comes out on top. I continue to not trust the grading on writing but it’s not meaningless.

image.png